Programs

All about Linear Regression using Scikit

In practice, there are two primary supervised machine learning algorithms: 1. Classification and 2. Regression — Classification is used to predict discrete outputs, while regression is used to predict continuous value output. 

In algebra, linearity denotes a straight or linear relationship between multiple variables. A literal representation of this relationship would be a straight line. 

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Linear regression is a machine learning algorithm that is executed under supervision. It is a process of looking for and mapping a line suitable for all the data points available on the said plot. It is a regression model that helps estimate the value between one dependent and one independent variable, all with the help of a straight line. 

Linear regression models help build a linear relationship between these independent variables, which have the lowest costs, based on the given dependent variables. 

In mathematics, we have three ways which are used to describe a linear regression model. They are as follows (y being the dependent variable):

  • y = intercept + (slope x) + error 
  • y = constant + (coefficientx) + error 
  • y = a + bx + e

Why is linear regression essential?

The models of linear regression are comparatively simpler and more user-friendly. They make the process of interpreting mathematical data/formulae capable of generating predictions relatively simpler. Linear regression can be instrumental in various fields (for instance, academics or business studies).

The linear regression model is the only scientifically proven method to accurately predict the future. It is used in various sciences from environmental, behavioural, social, etc. 

The properties of these models are very well understood and hence, easily trainable since it is a long-established statistical procedure. It also facilitates the transformation of copious raw data sets into actionable information.

Key assumptions of effective linear regression

  • The number of valid cases, mean, and standard deviation should be considered for each variable. 
  • For each model: Regression coefficients, correlation matrix, part and partial correlations, standard error of the estimate, analysis-of-variance table, predicted values, and residuals should be considered.
  • Plots: Scatterplots, histograms, partial plots, and normal probability plots are considered.
  • Data: It must be ensured that dependent and independent variables are quantitative. Categorical variables need not be re-coded to binary or dummy variables or other types of contrast variables.
  • Other assumptions: For every value of a given independent variable, we need a normal distribution of the dependent variable. The variance of the given distribution of the dependent variable should also be kept constant for every independent variable value. The relationship between every dependent independent variable should be linear. Plus, all observations should be independent.

Here is an existing example of a simple linear regression:

The dataset in the example contains information regarding the global weather situations of each day for a particular period. This detailed list of information includes factors like precipitation, snowfall, temperatures, wind speed, thunderstorms or other possible weather conditions. 

This problem aims to use the simple linear regression model to predict the maximum temperature while taking the minimum temperature as the input. 

Firstly, all the libraries need to be imported.

import pandas as pd  

import numpy as np  

import matplotlib.pyplot as plt  

import seaborn as seabornInstance 

from sklearn.model_selection import train_test_split 

from sklearn.linear_model import LinearRegression

from sklearn import metrics

%matplotlib inline

To import the following dataset using pandas, the following command needs to be applied:

dataset = pd.read_csv(‘/Users/nageshsinghchauhan/Documents/projects/ML/ML_BLOG_LInearRegression/Weather.csv’)

To check the number of rows and columns present in the dataset to explore the data, the following command needs to be applied: 

dataset.shape

The output received should be (119040, 31), which means the data contains 119040 rows and 31 columns.

To see the statistical details of the dataset, the following command can be used: 

describe():

dataset.describe()

Here is another example that will aim to demonstrate how one can retrieve and use various Python libraries which are to be used for applying linear regression to given data sets:

1. Importing all the required libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import preprocessing, svm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

2. Reading the data set

cd C:\Users\Dev\Desktop\Kaggle\Salinity

# Changing the file read location to the location of the dataset

df = pd.read_csv(‘bottle.csv’)

df_binary = df[[‘Salnty’, ‘T_degC’]]

# Taking only the selected two attributes from the dataset

df_binary.columns = [‘Sal’, ‘Temp’]

# Renaming the columns for easier writing of the code

df_binary.head()

# Displaying only the 1st rows along with the column names

2. Exploring the data scatter

sns.lmplot(x =”Sal”, y =”Temp”, data = df_binary, order = 2, ci = None)

# Plotting the data scatter

3. Data cleaning

# Eliminating NaN or missing input numbers

df_binary.fillna(method =’ffill’, inplace = True)

4. Training the model

X = np.array(df_binary[‘Sal’]).reshape(-1, 1)

y = np.array(df_binary[‘Temp’]).reshape(-1, 1)

# Separating the data into independent and dependent variables

# Converting each dataframe into a numpy array

# since each dataframe contains only one column

df_binary.dropna(inplace = True)

# Dropping any rows with Nan values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Splitting the data into training and testing data

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

5. Exploring the results

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color =’b’)

plt.plot(X_test, y_pred, color =’k’)

plt.show()

# Data scatter of predicted values

6. Working with a smaller dataset

df_binary500 = df_binary[:][:500]

# Selecting the 1st 500 rows of the data

sns.lmplot(x =”Sal”, y =”Temp”, data = df_binary500,

order = 2, ci = None)

Popular Machine Learning and Artificial Intelligence Blogs

If you are interested in learning full-fledged machine learning, we recommend joining upGrad’s Master of Science in Machine Learning & AI. The 20-months program is offered in association with IIIT Bangalore and Liverpool John Moores University. It is designed to help you build competence in industry-relevant programming languages, tools, and libraries like Python, Keras, Tensor Flow, MySql, Flask, Kubernetes, etc.

The program can help you ace advanced data science concepts through hands-on experience and skill-building. Plus, you get the upGrad advantage with access to 360° career counsel, a networking pool of 40,000+ paid learners, and a ton of collaborating opportunities!

Book your seat today!

What is linear regression used for

This kind of analysis is generally used to predict the value of one variable based on another known variable. The variables being used to find the value of the other one are called dependent and independent variables, respectively.

How to install scikit learn?

At first, the Scikit learn linear regression version provided by the concerned operating system or Python distribution needs to be installed. This is the quickest for people who have this option available. Then the officially released and latest updated version needs to be installed.

How does scikit learn work?

Scikit learn linear regression gives out a range of supervised and unsupervised algorithms through an interface of python, which is always consistent. It is licensed under a permissible BSD license. It is distributed under various Linux operators. Usage of these algorithms is widely encouraged in business and education.

Want to share this article?

Prepare for a Career of the Future

Leave a comment

Your email address will not be published. Required fields are marked *

Our Best Artificial Intelligence Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

×