Loan Approval Classification Using Logistic Regression in R

By Rohit Sharma

Updated on Aug 04, 2025 | 1.48K+ views

Share:

This project, Loan Approval Classification using Logistic Regression in R, focuses on predicting whether a loan application will be approved or not based on applicant information such as age, income, employment experience, credit score, and more. 

Using a loan dataset, we will preprocess the data, handle missing values, and build a logistic regression model in R using the caret package. 

This project will help you understand how to apply classification techniques for binary outcomes and evaluate model performance using accuracy, confusion matrix, and ROC curve.

Decode the future with upGrad’s Data Science courses. Gain hands-on skills in AI, Machine Learning, and Analytics, built for tomorrow’s data leaders. Enrol and accelerate your success today.

Also Read: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Essential Concepts to Know Before Starting the Loan Approval Classification Project

Before diving into this project, it's helpful to have a basic understanding of the following:

  • Classification Basics – You should know the difference between classification and regression, especially binary classification (e.g., approved vs not approved).
  • Logistic Regression – You also must understand how logistic regression models probabilities for classification tasks.
  • Data Types – Be able to identify numerical vs categorical features and why they matter in modeling.
  • Data Cleaning – Understanding missing values, outliers, and how to prepare data for modeling is crucial.
  • Model Evaluation – Knowing what accuracy, confusion matrix, and ROC curve tell you about your model’s performance helps a lot.
  • R Programming Basics – You must have a beginner’s understanding of using R syntax, loading libraries, and reading datasets.

Accelerate your career in AI and Data Science with upGrad’s globally recognized programs. From foundational certifications to advanced degrees, gain cutting-edge skills in Generative AI, Machine Learning, and Data Analysis. Enrol now and lead the change.

Time & Skill Needed for This Loan Approval Classification Project

This project requires a certain skill level and time to complete it. The different aspects are mentioned in the table below.

Aspect

Details

Estimated Duration 3 to 5 hours (including setup, coding, and evaluation)
Difficulty Level Beginner to Intermediate
Skill Level Beginner (basic R, data handling, and understanding of classification)

What You'll Use to Build This Loan Classification Project

To work on this project, we’ll be using the following tools and libraries mentioned in the table below:

Tool/Library

Purpose

Google Colab Cloud-based environment to run R code without local setup
readr To read CSV files into R
dplyr For data manipulation and cleaning
ggplot2 To create visualizations and plots
caret To train and evaluate machine learning models
randomForest To build the Random Forest classification model
caTools For splitting data into training and test sets
skimr To summarize and explore data easily

Also Read: R For Data Science: Why Should You Choose R for Data Science?

A Step-by-Step Breakdown of the Loan Approval Classification Project With Code

This section will now explain the steps of creating this project in R using Colab. The code and output are explained below.

Step 1: Configure Google Colab for R Programming

To begin working with R code in Google Colab, the environment must be set to use R instead of Python. This ensures compatibility with R syntax and libraries.

Follow these steps:

  • Launch a new notebook in Google Colab
  • Go to the top menu and click Runtime
  • Choose Change runtime type
  • In the Language dropdown, select R
  • Hit Save to apply the changes

Step 2: Install and Load Required R Packages

In this step, we install all the essential R packages needed for data cleaning, visualization, and machine learning. Once installed, we load them into the session to make them ready for use. Here’s the code:

## ---------- Step 1: Install packages (run once) ----------

# If a package is already installed, R will just skip it.



packages <- c(

  "tidyverse",   # Data manipulation and visualization

  "caret",       # Machine learning utilities

  "janitor",     # Clean column names

  "skimr",       # Quick data summaries

  "DataExplorer",# Automated EDA tools

  "corrplot",    # Correlation plots

  "randomForest",# Random Forest algorithm

  "xgboost",     # XGBoost algorithm

  "pROC",        # ROC curves and AUC

  "vip"          # Variable importance plots

)



# Identify packages that are not yet installed

new_pkgs <- packages[!(packages %in% installed.packages()[,"Package"])]



# Install missing packages only

if(length(new_pkgs)) install.packages(new_pkgs, dependencies = TRUE)



## ---------- Step 2: Load packages ----------

# Load all the libraries into the current R session

library(tidyverse)

library(caret)

library(janitor)

library(skimr)

library(DataExplorer)

library(corrplot)

library(randomForest)

library(xgboost)

library(pROC)

library(vip)

The above step installs and loads the packages and libraries simultaneously. The output of this code is:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘R.methodsS3’, ‘R.oo’, ‘R.utils’, ‘bitops’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘R.cache’, ‘caTools’, ‘TH.data’, ‘profileModel’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘plotrix’, ‘diagram’, ‘lava’, ‘styler’, ‘labelled’, ‘gplots’, ‘libcoin’, ‘matrixStats’, ‘multcomp’, ‘wk’, ‘permute’, ‘rbibutils’, ‘FNN’, ‘mclust’, ‘multicool’, ‘pracma’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘sparsevctrs’, ‘timeDate’, ‘brglm’, ‘gtools’, ‘lme4’, ‘qvcalc’, ‘rex’, ‘Formula’, ‘plotmo’, ‘prodlim’, ‘combinat’, ‘questionr’, ‘ROCR’, ‘mvtnorm’, ‘modeltools’, ‘strucchange’, ‘coin’, ‘zoo’, ‘sandwich’, ‘ROSE’, ‘plogr’, ‘classInt’, ‘s2’, ‘units’, ‘extrafontdb’, ‘Rttf2pt1’, ‘data.tree’, ‘ca’, ‘colorspace’, ‘gclus’, ‘qap’, ‘registry’, ‘TSP’, ‘vegan’, ‘visNetwork’, ‘Rdpack’, ‘lmtest’, ‘coda’, ‘biglm’, ‘minqa’, ‘statmod’, ‘tweedie’, ‘xmlparsedata’, ‘ks’, ‘crosstalk’, ‘RcppArmadillo’, ‘measures’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘recipes’, ‘reshape2’, ‘BradleyTerry2’, ‘covr’, ‘Cubist’, ‘earth’, ‘ellipse’, ‘fastICA’, ‘gam’, ‘ipred’, ‘kernlab’, ‘klaR’, ‘mda’, ‘mlbench’, ‘MLmetrics’, ‘pamr’, ‘party’, ‘pls’, ‘proxy’, ‘RANN’, ‘spls’, ‘superpc’, ‘themis’, ‘snakecase’, ‘RSQLite’, ‘sf’, ‘tidygraph’, ‘extrafont’, ‘gridExtra’, ‘networkD3’, ‘nycflights13’, ‘seriation’, ‘prettydoc’, ‘DiagrammeR’, ‘Ckmeans.1d.dp’, ‘vcd’, ‘cplm’, ‘lintr’, ‘igraph’, ‘float’, ‘titanic’, ‘microbenchmark’, ‘logcondens’, ‘doParallel’, ‘vdiffr’, ‘yardstick’, ‘bookdown’, ‘DT’, ‘fastshap’, ‘modeldata’, ‘NeuralNetTools’, ‘pdp’, ‘tinytest’, ‘varImp’

 

 

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

 dplyr    1.1.4      readr    2.1.5

 forcats  1.0.0      stringr  1.5.1

 ggplot2  3.5.2      tibble   3.3.0

 lubridate 1.9.4      tidyr    1.3.1

 purrr    1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

 dplyr::filter() masks stats::filter()

 dplyr::lag()    masks stats::lag()

Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading required package: lattice

 

Attaching package: ‘caret’

 

The following object is masked from ‘package:purrr’:

    lift

 

Attaching package: ‘janitor’

 

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test

  

corrplot 0.95 loaded
 

randomForest 4.7-1.2
 

Type rfNews() to see new features/changes/bug fixes.

 

Attaching package: ‘randomForest’

 

The following object is masked from ‘package:dplyr’:

    combine

 

The following object is masked from ‘package:ggplot2’:

    margin

  

Attaching package: ‘xgboost’ 
 

The following object is masked from ‘package:dplyr’:

    slice

 

 

Type 'citation("pROC")' for a citation.

 

Attaching package: ‘pROC’

 

The following objects are masked from ‘package:stats’:

    cov, smooth, var

 

Attaching package: ‘vip’

 

The following object is masked from ‘package:utils’:

    vi

Here’s an R Project: How to Build an Uber Data Analysis Project in R

Step 3: Upload and Read the Dataset

This step loads the dataset into your R environment. We also keep a raw copy for reference and take a quick look at the structure and dimensions. The code for this section is:


## ---------- Step 3: Read the data ----------

# Set the file path (update this if your filename is different)

data_path <- "loan_data.csv"


# Read the CSV file into R, without converting strings to factors

loan_raw <- read.csv(data_path, stringsAsFactors = FALSE)


# Always create a backup of the raw data for reference

loan <- loan_raw


# View the first few rows

head(loan)


# Check the number of rows and columns

dim(loan)

The above code loads the dataset into the Colab notebook and gives us a glimpse of the data we will be working with.

 

person_age

person_gender

person_education

person_income

person_emp_exp

person_home_ownership

loan_amnt

loan_intent

loan_int_rate

loan_percent_income

cb_person_cred_hist_length

credit_score

previous_loan_defaults_on_file

loan_status

 

<dbl>

<chr>

<chr>

<dbl>

<int>

<chr>

<dbl>

<chr>

<dbl>

<dbl>

<dbl>

<int>

<chr>

<int>

1

22

female

Master

71948

0

RENT

35000

PERSONAL

16.02

0.49

3

561

No

1

2

21

female

High School

12282

0

OWN

1000

EDUCATION

11.14

0.08

2

504

Yes

0

3

25

female

High School

12438

3

MORTGAGE

5500

MEDICAL

12.87

0.44

3

635

No

1

4

23

female

Bachelor

79753

0

RENT

35000

MEDICAL

15.23

0.44

2

675

No

1

5

24

male

Master

66135

1

RENT

35000

MEDICAL

14.27

0.53

4

586

No

1

6

21

female

High School

12951

0

OWN

2500

VENTURE

7.14

0.19

2

532

No

1

 

Step 4: Clean Column Names and Convert Target Variable

We now clean the dataset’s column names for easier access and convert the loan_status column into a labeled factor. This prepares the target variable for classification.

# Step 4: Read, clean and convert loan_status


# Set file path and read the CSV file

data_path <- "/content/loan_data.csv"

loan_raw <- read.csv(data_path, stringsAsFactors = FALSE)


# Load janitor and clean the column names (snake_case format)

library(janitor)

loan <- loan_raw %>%

  clean_names()


# Convert loan_status into a factor with labels: 0 = Rejected, 1 = Approved

loan$loan_status <- factor(loan$loan_status, levels = c(0, 1), labels = c("Rejected", "Approved"))


# Check the structure of the cleaned data

str(loan)


# View distribution of the target classes

table(loan$loan_status)

The output for the above code is:

'data.frame': 45000 obs. of  14 variables:

 $ person_age                    : num  22 21 25 23 24 21 26 24 24 21 ...

 $ person_gender                 : chr  "female" "female" "female" "female" ...

 $ person_education              : chr  "Master" "High School" "High School" "Bachelor" ...

 $ person_income                 : num  71948 12282 12438 79753 66135 ...

 $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...

 $ person_home_ownership         : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...

 $ loan_amnt                     : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...

 $ loan_intent                   : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...

 $ loan_int_rate                 : num  16 11.1 12.9 15.2 14.3 ...

 $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...

 $ cb_person_cred_hist_length    : num  3 2 3 2 4 2 3 4 2 3 ...

 $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...

 $ previous_loan_defaults_on_file: chr  "No" "Yes" "No" "No" ...

 $ loan_status                   : Factor w/ 2 levels "Rejected","Approved": 2 1 2 2 2 2 2 2 2 2 ...

 

Rejected Approved 

   35000    10000

 

The above output means that:

  • The dataset has 45,000 rows and 14 columns (i.e., 45,000 loan applications with 14 features each).
  • Variables include age, gender, income, loan amount, credit score, and more.
  • The target variable loan_status is a factor with two classes: Approved and Rejected.
  • Class distribution is imbalanced: 35,000 Rejected vs 10,000 Approved applications.

Step 5: Identify Missing Values

Before we handle missing data, we need to check how many missing values exist in each column. This helps us decide how to clean them. The code for this step is:

## Step 5: See total missing values per column

colSums(is.na(loan))  # Shows the number of NA values in each column

The output for this step is:

Person_age 0 person_gender 0 person_education 0 person_income 0 person_emp_exp 0 person_home_ownership 0 loan_amnt0loan_intent 0 loan_int_rate 0 loan_percent_income 0 cb_person_cred_hist_length 0 credit_score 0 previous_loan_defaults_on_file 0 loan_status 0

Step 6: Separate Numeric and Categorical Columns

To prepare for data preprocessing, we first identify which columns are numeric and which are categorical. Here’s the code for this step:

# Get target column

target_col <- "loan_status"


# Separate numeric and categorical columns

num_cols <- loan %>% select(where(is.numeric)) %>% names()    # Numeric features

cat_cols <- loan %>% select(where(is.character)) %>% names()  # Categorical features


# Print column types

cat("Numeric columns:\n", paste(num_cols, collapse = ", "), "\n\n")

cat("Categorical columns:\n", paste(cat_cols, collapse = ", "), "\n")

The output for the above code is:

Numeric columns:

person_age, person_income, person_emp_exp, loan_amnt, loan_int_rate, loan_percent_income, cb_person_cred_hist_length, credit_score 

 

Categorical columns:

person_gender, person_education, person_home_ownership, loan_intent, previous_loan_defaults_on_file 

Project in R: Car Data Analysis Project Using R

Step 7: Convert Categorical Columns to Factors

Machine learning models in R work better when categorical variables are encoded as factors. So we’ll convert the categorical columns into factors using the code:

## Step 4.3: Convert character columns to factors

loan[cat_cols] <- lapply(loan[cat_cols], factor)

# Check structure again to confirm the change

str(loan)

The output of this step is:

'data.frame': 45000 obs. of  14 variables:

 $ person_age                    : num  22 21 25 23 24 21 26 24 24 21 ...

 $ person_gender                 : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 1 1 1 ...

 $ person_education              : Factor w/ 5 levels "Associate","Bachelor",..: 5 4 4 2 5 4 2 4 1 4 ...

 $ person_income                 : num  71948 12282 12438 79753 66135 ...

 $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...

 $ person_home_ownership         : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 3 1 4 4 3 4 4 4 3 ...

 $ loan_amnt                     : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...

 $ loan_intent                   : Factor w/ 6 levels "DEBTCONSOLIDATION",..: 5 2 4 4 4 6 2 4 5 6 ...

 $ loan_int_rate                 : num  16 11.1 12.9 15.2 14.3 ...

 $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...

 $ cb_person_cred_hist_length    : num  3 2 3 2 4 2 3 4 2 3 ...

 $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...

 $ previous_loan_defaults_on_file: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...

 $ loan_status                   : Factor w/ 2 levels "Rejected","Approved": 2 1 2 2 2 2 2 2 2 2 ...

The above output means that:

  • The dataset has 45,000 rows and 14 columns.
  • Some columns are numeric (e.g., age, income).
  • Character columns are now converted to factors.
  • Factor columns show how many unique categories they have.

Step 8: Check Class Balance

This step checks how many loans were approved vs. rejected. It also shows what percentage each class represents, helping identify if the data is imbalanced. Here’s the code:

## Step 8: Class balance

table(loan_status = loan$loan_status)  # Count of Approved vs Rejected loans

prop.table(table(loan_status = loan$loan_status)) * 100  # Percentage distribution

The output for this step is given below:

loan_status

Rejected Approved 

   35000    10000

loan_status

Rejected Approved 

77.77778 22.22222

Step 9: Summary Statistics for Numeric Features

This step gives a quick statistical summary (min, max, mean, median, etc.) for all numeric columns in the dataset. Here’s the code:

## Step 9: Summary stats for numeric features

summary(loan[num_cols])

Here’s the output for this step:

person_age         person_income         person_emp_exp     loan_amnt    

 Min.   : 20.00       Min.   :   8000             Min.   :  0.00            Min.   :  500  

 1st Qu.: 24.00     1st Qu.:  47204           1st Qu.:  1.00           1st Qu.: 5000  

 Median : 26.00    Median :  67048        Median :  4.00         Median : 8000  

 Mean   : 27.76      Mean   :  80319         Mean   :  5.41           Mean   : 9583  

 3rd Qu.: 30.00     3rd Qu.:  95789         3rd Qu.:  8.00          3rd Qu.:12237  

 Max.   :144.00      Max.   :7200766        Max.   :125.00          Max.   :35000  

loan_int_rate       loan_percent_income   cb_person_cred_hist_length  credit_score  

 Min.   : 5.42         Min.   :0.0000               Min.   : 2.000         Min.   :390.0  

 1st Qu.: 8.59       1st Qu.:0.0700               1st Qu.: 3.000       1st Qu.:601.0  

 Median :11.01     Median :0.1200             Median : 4.000       Median :640.0  

 Mean   :11.01      Mean   :0.1397              Mean   : 5.867         Mean   :632.6  

 3rd Qu.:12.99     3rd Qu.:0.1900             3rd Qu.: 8.000         3rd Qu.:670.0  

 Max.   :20.00       Max.   :0.6600            Max.   :30.000          Max.   :850.0 

Also Read: 18 Types of Regression in Machine Learning You Should Know

Step 10: Visual Exploration of Loan Data

In this step, we visualize the dataset to understand patterns between features and loan approval status. Here’s the code for this step:


library(ggplot2)


# Plot: Loan Status by Loan Intent

if("loan_intent" %in% names(loan)) {

  ggplot(loan, aes(x = loan_intent, fill = loan_status)) +

    geom_bar(position = "fill") +

    labs(title = "Loan Status by Loan Intent", y = "Proportion") +

    theme_minimal()

}


# Plot: Applicant Income Distribution

if("person_income" %in% names(loan)) {

  ggplot(loan, aes(x = person_income)) +

    geom_histogram(bins = 30, fill = "skyblue", color = "black") +

    labs(title = "Applicant Income Distribution") +

    theme_minimal()

}


# Boxplot: Loan Amount by Loan Status

if("loan_amnt" %in% names(loan)) {

  ggplot(loan, aes(x = loan_status, y = loan_amnt, fill = loan_status)) +

    geom_boxplot() +

    labs(title = "Loan Amount by Loan Status") +

    theme_minimal()

}

The above code generates 3 graphs showing different results.

1. Loan Status by Loan Intent (Stacked Bar Chart)

  • Shows the proportion of approved vs. rejected loans across different loan purposes (e.g., personal, education).
  • Helps identify which loan intents have higher approval or rejection rates.

2. Applicant Income Distribution (Histogram)

  • Displays how applicant incomes are spread out across the dataset.
  • Helps detect if the data is skewed or if there are many low/high-income applicants.

3. Loan Amount by Loan Status (Boxplot)

  • Compares the distribution of loan amounts for approved vs. rejected applications.
  • Helps see if higher or lower loan amounts are more likely to be approved.

Improve Your R Skills: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 11: Check Correlation Between Numeric Features

In this step, we’ll generate a heatmap to explore how numeric features relate to one another. This helps identify patterns or highly related variables. Here’s the code:

# Load correlation plot library

library(corrplot)


# Correlation matrix

if(length(num_cols) > 1){

  corr_mat <- cor(loan[num_cols])

  corrplot(corr_mat, method = "color", type = "upper", tl.cex = 0.8)

}

Here’s the graph for the above code:

From the above plot:

  • Key numeric relationships are revealed – For instance, people with higher credit scores tend to have a lower percentage of income going toward loans.
  • Helps in feature selection – We can identify which variables are closely related, so we avoid multicollinearity in modeling and focus on the most useful predictors.

Step 12: Handle Missing Values

We clean the dataset by replacing missing values, numeric columns with their median, and categorical columns with their most frequent value (mode). The code for this step is:



# Helper function to compute the mode

get_mode <- function(x) {

  ux <- unique(x[!is.na(x)])

  ux[which.max(tabulate(match(x, ux)))]

}



# Make a copy of the data so we keep the original safe

loan_clean <- loan



# Impute numeric columns with median

for (col in num_cols) {

  med <- median(loan_clean[[col]], na.rm = TRUE)

  loan_clean[[col]][is.na(loan_clean[[col]])] <- med

}



# Impute categorical columns with mode

for (col in cat_cols) {

  mode_val <- get_mode(loan_clean[[col]])

  loan_clean[[col]][is.na(loan_clean[[col]]) | loan_clean[[col]] == ""] <- mode_val

}



# Confirm no missing values now

colSums(is.na(loan_clean))

The output for the above step is:

Person_age 0 person_gender 0 person_education 0 person_income 0 person_emp_exp 0 person_home_ownership 0 loan_amnt 0 loan_intent 0 loan_int_rate 0 loan_percent_income 0 cb_person_cred_hist_length 0 credit_score 0 previous_loan_defaults_on_file 0 loan_status 0

Here’s an Interesting R Project: Movie Rating Analysis Project in R

Step 13: Split the Data into Training and Test Sets

We split the cleaned data into 80% training and 20% test sets using stratified sampling to maintain class balance. Here’s the code:


# Load caret if not already loaded

library(caret)



# Set seed for reproducibility

set.seed(123)



# Create stratified split

index <- createDataPartition(loan_clean$loan_status, p = 0.8, list = FALSE)



# Create training and testing datasets

train <- loan_clean[index, ]

test  <- loan_clean[-index, ]



# Check dimensions

cat("Train rows:", nrow(train), "\n")

cat("Test rows:", nrow(test), "\n")

The output of the above step is:

Train rows: 36000 

Test rows: 9000 

Step 14: Set Up Model Training Control Parameters

Before training any model, we configure how it should be validated using 5-fold cross-validation repeated 2 times. We also set it up to calculate probabilities for better performance evaluation. Here’s the code:



# Set up training control for 5-fold cross-validation, repeated twice

ctrl <- trainControl(

  method = "repeatedcv",

  number = 5,

  repeats = 2,

  classProbs = TRUE,               # for probability-based metrics

  summaryFunction = twoClassSummary # for ROC, Sensitivity, Specificity

)

The above step:

  • Prepares the model for fair, repeated cross-validation.
  • Enables us to evaluate with ROC and other class metrics.

Step 15: Train the Logistic Regression Model

We now train a logistic regression model using all features to predict whether a loan will be approved. Logistic regression is a simple and interpretable baseline model. Here’s the code:

# Make sure the target has correct reference level (positive class first)

train$loan_status <- relevel(train$loan_status, ref = "Approved")



# Build formula: loan_status ~ all other columns

formula_all <- loan_status ~ .



# Train logistic regression

set.seed(123)

fit_glm <- train(

  formula_all,

  data = train,

  method = "glm",

  family = binomial,

  trControl = ctrl,

  metric = "ROC"

)



# View model summary

fit_glm

Here’s the output:

Generalized Linear Model 

 

36000 samples

   13 predictor

    2 classes: 'Approved', 'Rejected' 

 

No pre-processing

Resampling: Cross-Validated (5 fold, repeated 2 times) 

Summary of sample sizes: 28800, 28800, 28800, 28800, 28800, 28800, ... 

Resampling results:

 

  ROC        Sens       Spec     

  0.9539318  0.7500625  0.9377143

The above output means that:

  • ROC (0.95):
    Excellent model performance, 95% chance the model correctly distinguishes between approved and rejected loans.
  • Sensitivity (0.75):
    The model correctly identifies 75% of approved loans.
  • Specificity (0.94):
    The model correctly identifies 94% of rejected loans.

Step 16: Make Predictions on the Test Data

Before evaluating, we first generate predictions using the trained model. Here’s the code:

# Predict class labels

pred_class <- predict(fit_glm, newdata = test)



# Predict probabilities (probability of being "Approved")

pred_prob <- predict(fit_glm, newdata = test, type = "prob")[, "Approved"]

Here we are making predictions using the trained logistic regression model on the unseen test data. Here's what each part does:

  • pred_class: Predicts whether a loan in the test set will be Approved or Rejected; this is the final classification.
  • pred_prob: Gives the confidence level (probability) that a loan will be approved, which is useful for ROC and AUC evaluation.

Step 17: Evaluate Model with Confusion Matrix

We now assess how well the model performs using a confusion matrix, which compares actual vs. predicted outcomes. Here’s the code:

# Confusion Matrix

confusionMatrix(pred_class, test$loan_status, positive = "Approved")

The output for this step is:

Warning message in confusionMatrix.default(pred_class, test$loan_status, positive = "Approved"):

“Levels are not in the same order for reference and data. Refactoring data to match.”

 

Confusion Matrix and Statistics

 

          Reference

Prediction Rejected Approved

  Rejected     6567      488

  Approved      433     1512

                                          

               Accuracy : 0.8977          

                 95% CI : (0.8912, 0.9039)

    No Information Rate : 0.7778          

    P-Value [Acc > NIR] : < 2e-16         

                                          

                  Kappa : 0.701           

                                          

 Mcnemar's Test P-Value : 0.07518         

                                          

            Sensitivity : 0.7560          

            Specificity : 0.9381          

         Pos Pred Value : 0.7774          

         Neg Pred Value : 0.9308          

             Prevalence : 0.2222          

         Detection Rate : 0.1680          

   Detection Prevalence : 0.2161          

      Balanced Accuracy : 0.8471          

                                          

       'Positive' Class : Approved        

                                        

The above output means that:

  • The model has an overall accuracy of 89.77%, meaning it correctly predicts most loan outcomes.
  • It is very good at identifying rejected loans with a specificity of 93.81%.
  • It performs reasonably well in detecting approved loans with a sensitivity of 75.60%.
  • The positive predictive value (precision for approved loans) is 77.74%, indicating that most predicted approvals are correct.
  • The balanced accuracy is 84.71%, showing the model performs well across both classes, not just one.

Here’s an R Project: Wine Quality Prediction Project in R

Step 18: Plot ROC Curve and Get AUC Score

This step helps visualize how well the model separates approved and rejected loans. We also compute the AUC score to summarize model performance. This is the code for this step:


library(pROC)



# Create ROC object

roc_obj <- roc(response = test$loan_status, predictor = pred_prob, levels = c("Rejected", "Approved"))



# Plot ROC

plot(roc_obj, col = "blue", main = "ROC Curve - Logistic Regression")



# AUC score

auc(roc_obj)

The graph for the above code is:

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The above graph shows:

  • The ROC Curve shows the trade-off between sensitivity and specificity.
  • The AUC score (Area Under the Curve) tells how well the model distinguishes between the two classes; closer to 1 means better performance.

Conclusion

In this Loan Approval Classification project, we built a logistic regression model in R using Google Colab to predict whether a loan would be approved or rejected based on applicant and loan-related features.

After handling missing values, encoding categorical variables, and splitting the data, we trained the model with 5-fold repeated cross-validation. The model was evaluated using a confusion matrix, ROC curve, and AUC score.

It achieved an accuracy of 89.77% and an AUC of around 0.95, showing strong performance in classifying loan approvals accurately while maintaining a good balance between sensitivity and specificity.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1kMt6Goyje9PMP9bXMjsMk1khpRiGzxN3#scrollTo=avkmZQTSOc4z

Frequently Asked Questions (FAQs)

1. What is the goal of this Loan Approval Classification project in R?

2. Which tools and R packages are used for this project?

3. Can other algorithms be used instead of logistic regression?

4. What are some other beginner-friendly classification projects in R?

5. How is model performance measured in this project?

Rohit Sharma

826 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months