1. Home

Logistic Regression Courses

Data Science is a dynamic field, leading today’s workforce towards data analysis, understanding large amounts of information and insights.

banner image

Logistic Regression Course Overview

If you are new to data analytics and machine learning, you may attempt to learn different techniques and tools related to these fields. One specific kind of analysis that data analysts widely use is logistic regression. However, before using it, you must first understand the logistic regression meaning.

Logistic regression is a controlled classification algorithm. In a logistic regression classification problem, the output, i.e., target variable (y), can accept only discrete values for a given set of features, i.e., inputs (x).

After understanding the logistic regression meaning and the logistic regression from scratch, the next vital thing is to know when to use it. Logistic regression helps predict the categorical dependent variable. It is useful when the prediction is definite, like true or false, yes or no, 0 or 1. The value of the predicted probability or outcome of logistic regression can be any one - no middle value is possible.

Regarding the predictor variables, logistic regression can fall into any of the below categories:

i. Continuous:

Continuous data is classified as ratio data or interval data. You can measure it on an infinite scale. It can accept any value from the two numbers. The example includes temperature weight in grams or temperature in degrees Celsius.

ii. Discrete (ordinal):

The ordinal logistic regression shows data suitable for inclusion into a specific type of order on a scale. The logistic example for this category can be the eyes’ color: black, brown, or blue. You can understand it with another example, like describing how satisfied you are with a service or product on a scale of 1-5.

Logistic regression analysis is significant for forecasting the probability of an event. It lets you decide the probabilities among any two classes. You can only anticipate probability and classification outcomes via logistic regression.

A logistic regression model can assist you to categorize data for extract, transform, and load (ETL) operations. Remember logistic regression must not be used if the number of observations is lower than the number of features. If used, it leads to overfitting.

The working of logistic regression involves forecasting the “log odds” in form of a linear equation, identical to linear regression. After you forecast this, you use a sigmoid function for this prediction to measure the probability. The outcomes are probability values, so they fall in the range from 0 to 1. The typical cutoff value is 0.5. Those values below 0.5 fall into one class and those values above 0.5 belong to another class.

The logistic regression sigmoid function (also known as the logistic regression model) maps the forecasted predictions to probabilities. Here, the Sigmoid function depicts an S-shaped curve whenever its plotting takes place on a map. The corresponding graphs plot the forecasted values in the range of 0 to 1. These values are subsequently plotted near the margins at the upper and lower part of the Y-axis, along with the labels 0 and 1. As per these values, the classification of the target variable takes place in any one of the classes.

Here is the formula of the Sigmoid function:

y=1/(1+e^x)

(here, e^x shows the exponential constant and its value is 2.718)

The above equation provides the value of y near 0 if x is a substantial negative value. Identically, if x’s value is a large positive number, the value of y is forecasted close to 1.

When a decision boundary is set, it helps to predict the particular class for the belonging data. As per the set value, the estimated values categorizes into classes.

Let’s understand this with an example of categorizing emails to be spam or not. If the predicted value is below 0.5, the email is known as spam and vice versa.

In logistic regression, the sigmoid function works as an advanced regression method for solving diverse classification problems. It is a classification model, and so its alternate name is ‘regression’. Another reason behind this name is the fundamental techniques are identical to linear regression.

You can understand the sigmoid function to be a mathematical function with a typical “S” — shaped curve. This curve converts the values in the range of 0 to 1 - it asymptotes both values. Other names for this function are the logistic function and the sigmoidal curve. This function is useful in terms of non-linear activation functions, adding to its popularity.

The logit model in r or sigmoid helps to predict the probabilities of a binary result. Furthermore, the sigmoid function transforms a regression line into a decision boundary for the logistic regression binary classification. Its working is like the logit model.

Let’s understand the Sigmoid function with an example. Suppose we assume a standard regression problem like -

z = βtx

and it passes through a sigmoid function

σ(z) = σ(βtx))

When you implement the Sigmoid function, you get an S curve instead of a straight line. This shape shows the growth increases till it attains climax and declines afterward. This makes it easy for the binary classification with 0 and 1 as possible output values. When the value of the linear regression model shows 2.5, 5, or 10, the sigmoid function arranges it into classes related to 1.

σ(z) < 0.5 for z<0

σ(z)≥0.5 for z≥0

The sigmoid function helps in credit card fraud detection using logistic regression. Let’s understand it with an example. Suppose a sigmoid function intends to classify credit card transactions as fraudulent or genuine. When the value of the function shows a 70% probability, the transaction is s fraudulent. To represent this, you will write -

hβ(x) = 0.7

Decision Boundary:

The primary application of logistic regression helps determine a decision boundary for use in binary logistic regression classification problems. The baseline helps to recognize a binary decision boundary. So, this approach is useful for scenarios featuring logistic regression for multiclass classification.

In the multiple logistic regression, the decision boundary shows a linear line separating class A and class B. Certain points from class A fall into the area of class B. The reason is in the linear model, it's hard to obtain the precise boundary line discriminating the two classes.

When logistic regression training works on a classifier on a dataset with the help of a precise classification algorithm, it is necessary to state a set of hyper-planes, known as ‘Decision Boundary’. It discriminates the data points into explicit classes wherein the algorithm transits from one class to another.

One side of a decision boundary shows data points, probably called ‘Class A’. The other side of the boundary is probably Class B.

Logistic regression aims to come up with an approach to divide the data points to get a precise prediction of any particular observation’s class. For this, the information available in the features is helpful.

Now let’s take the logistic regression example to understand this.Suppose we specify a line describing a decision boundary. Every point on one side of this boundary will feature all data points belonging to Class A. All those points present on another side of the boundary will feature all the data points belonging to class B.

For the use of logistic regression, the following formula is useful:

S(z)=1/(1+e^-z)

Here, S(z) = output in range of 0 to 1 (probability estimate)

z = Input to the function (z= mx + b)

e = Base of natural log

The prediction function in the use shows a probability score in the range of 0 to 1. To map the same to a discrete class (A/B), you should choose a threshold value, or you can say a tipping point. Any value above this threshold value or a tipping point will be categorized into class A. Any value below this point will classify into class B.

p >= 0.5 for class=A

p <= 0.5 for class=B

In case the threshold value is 0.5 and the prediction function returns 0.7, this observation classifies into class A. If the prediction value was 0.2, the observation classifies into class B. Hence, the line with 0.5 is the decision boundary.

In a logistic regression model, certain assumptions help enhance its performance. Let’s go through the details of each of the assumptions:

logistic regression assumptions

Assumption-1: The nature of the response variable is binary

Logistic regression mandates the response variable to take only one of the two probable outcomes. The corresponding examples are:

  • Yes or No
  • Pass or Fail
  • Male or Female
  • Malignant or Benign
  • Drafted or Not Drafted

Assumption-2: The observations are independent

Logistic regression presumes the observations in the dataset to be independent of each other. The observations must not derive from recurrent quantities of the same individual or be associated with each other in any form.

Assumption-3: No Multicollinearity between explanatory variables

No multicollinearity exists between the explanatory variables as per the assumption of logistic regression.

Multicollinearity takes place whenever two or multiple explanatory variables are highly linked to each other, in a way they don’t offer exceptional or independent info in the regression model. If the amount of correlation is large among the variables, it leads to issues when interpreting logistic regression and fitting its model.

The typical method to identify multicollinearity is to use the variance inflation factor (VIF). This factor calculates the strength and correlation among the predictor variables existing in a regression model

Assumption-4: No extreme outliers

In the logistic regression concept, the assumption is there are zero influential observations or extreme outliers in the dataset.

Now, the question is - how can you test this assumption?

The typical approach to check for influential observations and extreme outliers in a dataset is to measure the object’s distance for every observation. In case there exist indeed outliers, you can select to (i) discard them, (ii) substitute them with a value like the median or mean, or (3) save them in the model and make a note of this when recording the regression results.

Assumption-5: Existence of logit of the response variable and linear relationship between explanatory variables
Logistic regression presumes the existence of a linear relationship between every explanatory variable and the logit model of the response variable.
The logit function is defined as below:
Logit(p) = log(p / (1-p)) (here p shows the probability of a positive outcome)

The simplest way to check this assumption is to use a Box-Tidwell test.
Assumption-6: The sample size is significantly big
Another assumption in the Logistic regression is the sample size of the dataset is sufficiently big to derive accurate conclusions from a particular fitted logistic regression model.
To test this assumption, you must use at least 5 cases with the minimum frequent outcome for every explanatory variable. Suppose you work on two explanatory variables and suppose the predictable likelihood of the least frequent result is 0.10. In this case, the sample size must be minimum (5*2) / 0.10 = 100.

Best Data Science Courses

Programs From Top Universities

upGrad's data science degrees offer an immersive learning experience. These data science certification courses are designed in collaboration with top universities, ensuring industry-relevant curriculum. Learners from our data science online classes gain insights into big data & ML technologies.

Data Science & Analytics (0)

Filter

Loading...

upGrad Learner Support

Talk to our experts. We’re available 24/7.

text

Indian Nationals

1800 210 2020

text

Foreign Nationals

+918045604032

Disclaimer

upGrad does not grant credit; credits are granted, accepted or transferred at the sole discretion of the relevant educational institution offering the diploma or degree. We advise you to enquire further regarding the suitability of this program for your academic, professional requirements and job prospects before enr...