All about Linear Regression using Scikit

In practice, there are two primary supervised machine learning algorithms: 1. Classification and 2. Regression — Classification is used to predict discrete outputs, while regression is used to predict continuous value output.

In algebra, linearity denotes a straight or linear relationship between multiple variables. A literal representation of this relationship would be a straight line.

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Linear regression is a machine learning algorithm that is executed under supervision. It is a process of looking for and mapping a line suitable for all the data points available on the said plot. It is a regression model that helps estimate the value between one dependent and one independent variable, all with the help of a straight line.

Linear regression models help build a linear relationship between these independent variables, which have the lowest costs, based on the given dependent variables.

In mathematics, we have three ways which are used to describe a linear regression model. They are as follows (y being the dependent variable):

y = intercept + (slope x) + error
y = constant + (coefficientx) + error
y = a + bx + e

Why is linear regression essential?

The models of linear regression are comparatively simpler and more user-friendly. They make the process of interpreting mathematical data/formulae capable of generating predictions relatively simpler. Linear regression can be instrumental in various fields (for instance, academics or business studies).

The linear regression model is the only scientifically proven method to accurately predict the future. It is used in various sciences from environmental, behavioural, social, etc.

The properties of these models are very well understood and hence, easily trainable since it is a long-established statistical procedure. It also facilitates the transformation of copious raw data sets into actionable information.

Key assumptions of effective linear regression

The number of valid cases, mean, and standard deviation should be considered for each variable.
For each model: Regression coefficients, correlation matrix, part and partial correlations, standard error of the estimate, analysis-of-variance table, predicted values, and residuals should be considered.
Plots: Scatterplots, histograms, partial plots, and normal probability plots are considered.
Data: It must be ensured that dependent and independent variables are quantitative. Categorical variables need not be re-coded to binary or dummy variables or other types of contrast variables.
Other assumptions: For every value of a given independent variable, we need a normal distribution of the dependent variable. The variance of the given distribution of the dependent variable should also be kept constant for every independent variable value. The relationship between every dependent independent variable should be linear. Plus, all observations should be independent.

Here is an existing example of a simple linear regression:

The dataset in the example contains information regarding the global weather situations of each day for a particular period. This detailed list of information includes factors like precipitation, snowfall, temperatures, wind speed, thunderstorms or other possible weather conditions.

This problem aims to use the simple linear regression model to predict the maximum temperature while taking the minimum temperature as the input.

Firstly, all the libraries need to be imported.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as seabornInstance

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

%matplotlib inline

To import the following dataset using pandas, the following command needs to be applied:

dataset = pd.read_csv(‘/Users/nageshsinghchauhan/Documents/projects/ML/ML_BLOG_LInearRegression/Weather.csv’)

To check the number of rows and columns present in the dataset to explore the data, the following command needs to be applied:

dataset.shape

The output received should be (119040, 31), which means the data contains 119040 rows and 31 columns.

To see the statistical details of the dataset, the following command can be used:

describe():

dataset.describe()

Here is another example that will aim to demonstrate how one can retrieve and use various Python libraries which are to be used for applying linear regression to given data sets:

1. Importing all the required libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import preprocessing, svm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

2. Reading the data set

cd C:\Users\Dev\Desktop\Kaggle\Salinity

# Changing the file read location to the location of the dataset

df = pd.read_csv(‘bottle.csv’)

df_binary = df[[‘Salnty’, ‘T_degC’]]

# Taking only the selected two attributes from the dataset

df_binary.columns = [‘Sal’, ‘Temp’]

# Renaming the columns for easier writing of the code

df_binary.head()

# Displaying only the 1st rows along with the column names

2. Exploring the data scatter

sns.lmplot(x =”Sal”, y =”Temp”, data = df_binary, order = 2, ci = None)

# Plotting the data scatter

3. Data cleaning

# Eliminating NaN or missing input numbers

df_binary.fillna(method =’ffill’, inplace = True)

4. Training the model

X = np.array(df_binary[‘Sal’]).reshape(-1, 1)

y = np.array(df_binary[‘Temp’]).reshape(-1, 1)

# Separating the data into independent and dependent variables

# Converting each dataframe into a numpy array

# since each dataframe contains only one column

df_binary.dropna(inplace = True)

# Dropping any rows with Nan values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Splitting the data into training and testing data

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

5. Exploring the results

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color =’b’)

plt.plot(X_test, y_pred, color =’k’)

plt.show()

# Data scatter of predicted values

6. Working with a smaller dataset

df_binary500 = df_binary[:][:500]

# Selecting the 1st 500 rows of the data

sns.lmplot(x =”Sal”, y =”Temp”, data = df_binary500,

order = 2, ci = None)

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm? Simple & Easy
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is IoT (Internet of Things)
Permutation vs Combination: Difference between Permutation and Combination	Top 7 Trends in Artificial Intelligence & Machine Learning	Machine Learning with R: Everything You Need to Know
AI & ML Free Courses
Introduction to NLP	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

If you are interested in learning full-fledged machine learning, we recommend joining upGrad’s Master of Science in Machine Learning & AI. The 20-months program is offered in association with IIIT Bangalore and Liverpool John Moores University. It is designed to help you build competence in industry-relevant programming languages, tools, and libraries like Python, Keras, Tensor Flow, MySql, Flask, Kubernetes, etc.

The program can help you ace advanced data science concepts through hands-on experience and skill-building. Plus, you get the upGrad advantage with access to 360° career counsel, a networking pool of 40,000+ paid learners, and a ton of collaborating opportunities!

Book your seat today!

Frequently Asked Questions (FAQs)

1. What is linear regression used for

This kind of analysis is generally used to predict the value of one variable based on another known variable. The variables being used to find the value of the other one are called dependent and independent variables, respectively.

2. How to install scikit learn?

At first, the Scikit learn linear regression version provided by the concerned operating system or Python distribution needs to be installed. This is the quickest for people who have this option available. Then the officially released and latest updated version needs to be installed.

3. How does scikit learn work?

Scikit learn linear regression gives out a range of supervised and unsupervised algorithms through an interface of python, which is always consistent. It is licensed under a permissible BSD license. It is distributed under various Linux operators. Usage of these algorithms is widely encouraged in business and education.

Suggested Blogs

63210

Top 25 New & Trending Technologies in 2024 You Should Know About

Introduction As someone deeply immersed in the ever-changing landscape of technology, I’ve witnessed firsthand the rapid evolution of trending

by Rohit Sharma

23 Jan 2024

6375

Basic CNN Architecture: Explaining 5 Layers of Convolutional Neural Network [US]

A CNN (Convolutional Neural Network) is a type of deep learning neural network that uses a combination of convolutional and subsampling layers to lear

by Pavan Vadapalli

15 Apr 2023

5509

Top 10 Speech Recognition Softwares You Should Know About

What is a Speech Recognition Software? Speech Recognition Software programs are computer programs that interpret human speech and convert it into tex

by Sriram

26 Feb 2023

6138

Top 16 Artificial Intelligence Project Ideas & Topics for Beginners [2024]

Artificial intelligence controls computers to resemble the decision-making and problem-solving competencies of a human brain. It works on tasks usuall

by Sriram

26 Feb 2023

5614

15 Interesting Machine Learning Project Ideas For Beginners & Experienced [2024]

Taking on machine learning projects as a beginner is an excellent way to gain hands-on experience and develop a better understanding of the fundamenta

by Sriram

26 Feb 2023

5211

Explaining 5 Layers of Convolutional Neural Network

A CNN (Convolutional Neural Network) is a type of deep learning neural network that uses a combination of convolutional and subsampling layers to lear

by Sriram

26 Feb 2023

9809

20 Exciting IoT Project Ideas & Topics in 2024 [For Beginners & Experienced]

IoT (Internet of Things) is a network that houses multiple smart devices connected to one Cloud source. This network can be regulated in several ways

by Sriram

25 Feb 2023

7577

Why Is Time Complexity Important: Algorithms, Types & Comparison

Time complexity is a measure of the amount of time needed to execute an algorithm. It is a function of the algorithm’s input size and the type o

by Sriram

25 Feb 2023

11287

Curse of dimensionality in Machine Learning: How to Solve The Curse?

Machine learning can effectively analyze data with several dimensions. However, it becomes complex to develop relevant models as the number of dimensi

by Sriram

25 Feb 2023

All about Linear Regression using Scikit

Why is linear regression essential?

Key assumptions of effective linear regression

1. Importing all the required libraries

2. Reading the data set

2. Exploring the data scatter

3. Data cleaning

4. Training the model

5. Exploring the results

Popular AI and ML Blogs & Free Courses

Rohan Vats

Our Trending Artificial Intelligence Courses

Our Best Artificial Intelligence Course

Frequently Asked Questions (FAQs)

Explore Free Courses

Suggested Blogs