In this article, we’ll discuss one of the most common yet challenging concepts in machine learning, logistic regression. You’ll find what logistic regression is and the derivation of the logistic regression equation in this detailed article.
We’ve also shared an example of logistic regression in R to understand the concept with much ease. However, ensure that you know all the ideas reasonably well before you work on the example. It would be helpful if you’re familiar with linear regression because both of these concepts are interlinked.
What is Logistic Regression?
Logistic regression predicts a binary outcome according to a set of independent variables. It is a classification algorithm that predicts the probability of an event’s occurrence using a logit function and fitting data to it. Logistic regression is different from linear regression as it can predict the likelihood of a result that can only have two values. Using linear regression is not suitable when you have a binary variable because:
- The linear regression would predict values outside the required range
- The regression might not distribute the two benefits across one predicted line
Logistic regression doesn’t produce a line as a linear regression does. It provides a logistic curve that ranges between 0 and value more than 1.
Check out: R Project Ideas
Logistic Regression Equation Derivation
We can derive the logistic regression equation from the linear regression equation. Logistic regression falls under the class of glm algorithms (Generalized Linear Model). Nelder and Wedderburn introduced this model in 1972 as a method of using linear regression to solve problems that it couldn’t solve before. They had proposed a class of separate models and had added logistic regression as a special one.
We know that the equation of a generalized linear model is the following:
g(e<y) = a + bx1
g() stands for the link function, E(y) stands for the expectation of the target variable, and the RHS (right-hand side) is the linear predictor. The link function ‘links’ the expectation of y with the linear predictor.
Suppose we have data of 100 clients, and we need to predict whether a client will buy a specific product or not. As we have a categorical outcome variable, we must use logistic regression.
We’ll start with a linear regression equation:
g(y) = o+(income) — (1)
Here, we’ve kept the independent variable as ‘income’ for ease of understanding.
Our focus is on the probability of the resultant dependent variable (will the customer buy or not?). As we’ve already discussed, g() is our link function, and it is based on the Probability of Success (p) and Probability of Failure (1-p). p should have the following qualities:
- p should always be positive
- p should always be less than or equal to 1
Now, we’ll denote g() with ‘p’ and derive our logistic regression equation.
As probability is always positive, we’ll cover the linear equation in its exponential form and get the following result:
p = exp(0+(income)) = e((0+(income)) — (2)
We’ll have to divide p by a number greater than p to make the probability less than 1:
p = exp(0+(income)) / (0+(income)) + 1 = e(0+(income)) / (0+(income)) + 1 — (3)
By using eq. (1), (2), and (3), we can define p as:
p = ey /1 + ey — (4)
Here, p is the probability of success, so 1-p must be the probability of failure:
q = 1 – p = 1 -(ey /1 + ey) — (5)
Let’s now divide (4) by (5):
p / 1 – p = ey
If we take log on both sides, we get the following:
log (p / 1 – p) = y
This is the link function. When we substitute the value of y we had established previously, we get:
log(p / 1 – p) = o + (income)
And there we have it, the logistic regression equation. As it provides the probability of a result, its value always remains between 0 and above 1.
Example of Logistic Regression in R
In our case of logistic regression in R, we’re using data from UCLA (University of California, Los Angeles). Here, we have to create a model that predicts the chances of getting admit according to the data we have. We have four variables, including GPA, GRE score, the rank of the student’s undergraduate college, and confess.
df <- read.csv(“https://stats.idre.ucla.edu/stat/data/binary.csv”)
## ‘data.frame’: 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 …
## $ gre : int 380 660 800 640 520 760 560 400 540 700 …
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 …
## $ rank : int 3 3 1 4 4 2 1 2 3 2 …
Variables are either number or integer:
##  0
We also find that there are no null values, and there are more events of rejects than of acceptance because the mean of the variable limit is smaller than 0.5.
You should make sure that the system distributes admits appropriately in every category of rank. Suppose one rank has only 5 rejects (or admit information), then you don’t necessarily have to use that rank in your analysis.
xtabs(~ admit +rank ,data=df)
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
Let’s run our function now:
df$rank <- as.factor(df$rank)
logit <- glm(admit ~ gre+gpa+rank,data=df,family=”binomial”)
## glm(formula = admit ~ gre + gpa + rank, family = “binomial”,
## data = df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
## Number of Fisher Scoring iterations: 4
You must’ve noticed that we have converted the rank variable to factor from integer before running the function. Make sure that you do the same.
Suppose a student’s GPA is 3.8, a GRE score of 790, and he studied in a rank-1 college. Let’s find his chances of getting admit in the future by using our model:
x <- data.frame(gre=790,gpa=3.8,rank=as.factor(1))
Our model predicts that the boy has an 85% chance of getting the admit in the future.
Also Read: Machine Learning Project Ideas
That’s it for this article. We’re confident that you’d have found it quite helpful. If you have any questions or thoughts on logistic regression and its related topics, please share them in the comment section below.
If you are curious to learn about R, everything about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.