Home
Blog
Data Science
Loan Approval Classification Using Logistic Regression in R

Loan Approval Classification Using Logistic Regression in R

Q: 1. What is the goal of this Loan Approval Classification project in R?

The primary objective of this project is to build a machine learning model that can accurately predict whether a loan will be approved or rejected based on applicant and loan-related features. Logistic regression is used for classification, and the model is evaluated using metrics such as ROC-AUC and a confusion matrix.

Q: 2. Which tools and R packages are used for this project?

This project is developed using R in Google Colab. The main tools and libraries used include: caret for model training, cross-validation, and evaluation dplyr and tidyr for data wrangling ggplot2 for data visualization pROC for ROC curve and AUC score corrplot for visualizing correlations among numeric variables

Q: 3. Can other algorithms be used instead of logistic regression?

Absolutely. While logistic regression works well for binary classification, you can experiment with other algorithms to improve performance or compare results. These include: Random Forest Support Vector Machines (SVM) Gradient Boosting Machines (GBM) XGBoost Naive Bayes

Q: 4. What are some other beginner-friendly classification projects in R?

If you're exploring more R-based projects to expand your skillset, here are a few great alternatives to try: Forest Fire Project Using R - A Step-by-Step Guide Player Performance Analysis & Prediction Using R Natural Disaster Prediction Analysis Project in R Titanic Survival Prediction in R: Complete Guide with Code

Q: 5. How is model performance measured in this project?

The model's performance is evaluated through several metrics: Confusion Matrix for accuracy, sensitivity (recall), and specificity ROC Curve and AUC for measuring the model’s ability to distinguish between classes Repeated 5-fold Cross-validation for more reliable performance estimates

By Rohit Sharma

Updated on Aug 04, 2025 | 1.86K+ views

Table of Contents

View all

Essential Concepts to Know Before Starting the Loan Approval Classification Project
Time & Skill Needed for This Loan Approval Classification Project
What You'll Use to Build This Loan Classification Project
A Step-by-Step Breakdown of the Loan Approval Classification Project With Code
Conclusion

This project, Loan Approval Classification using Logistic Regression in R, focuses on predicting whether a loan application will be approved or not based on applicant information such as age, income, employment experience, credit score, and more.

Using a loan dataset, we will preprocess the data, handle missing values, and build a logistic regression model in R using the caret package.

This project will help you understand how to apply classification techniques for binary outcomes and evaluate model performance using accuracy, confusion matrix, and ROC curve.

Decode the future with upGrad’s Data Science courses. Gain hands-on skills in AI, Machine Learning, and Analytics, built for tomorrow’s data leaders. Enrol and accelerate your success today.

Also Read: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Essential Concepts to Know Before Starting the Loan Approval Classification Project

Before diving into this project, it's helpful to have a basic understanding of the following:

Classification Basics – You should know the difference between classification and regression, especially binary classification (e.g., approved vs not approved).
Logistic Regression – You also must understand how logistic regression models probabilities for classification tasks.
Data Types – Be able to identify numerical vs categorical features and why they matter in modeling.
Data Cleaning – Understanding missing values, outliers, and how to prepare data for modeling is crucial.
Model Evaluation – Knowing what accuracy, confusion matrix, and ROC curve tell you about your model’s performance helps a lot.
R Programming Basics – You must have a beginner’s understanding of using R syntax, loading libraries, and reading datasets.

Accelerate your career in AI and Data Science with upGrad’s globally recognized programs. From foundational certifications to advanced degrees, gain cutting-edge skills in Generative AI, Machine Learning, and Data Analysis. Enrol now and lead the change.

Time & Skill Needed for This Loan Approval Classification Project

This project requires a certain skill level and time to complete it. The different aspects are mentioned in the table below.

Aspect	Details
Estimated Duration	3 to 5 hours (including setup, coding, and evaluation)
Difficulty Level	Beginner to Intermediate
Skill Level	Beginner (basic R, data handling, and understanding of classification)

What You'll Use to Build This Loan Classification Project

To work on this project, we’ll be using the following tools and libraries mentioned in the table below:

Tool/Library	Purpose
Google Colab	Cloud-based environment to run R code without local setup
readr	To read CSV files into R
dplyr	For data manipulation and cleaning
ggplot2	To create visualizations and plots
caret	To train and evaluate machine learning models
randomForest	To build the Random Forest classification model
caTools	For splitting data into training and test sets
skimr	To summarize and explore data easily

Also Read: R For Data Science: Why Should You Choose R for Data Science?

A Step-by-Step Breakdown of the Loan Approval Classification Project With Code

This section will now explain the steps of creating this project in R using Colab. The code and output are explained below.

Step 1: Configure Google Colab for R Programming

To begin working with R code in Google Colab, the environment must be set to use R instead of Python. This ensures compatibility with R syntax and libraries.

Follow these steps:

Launch a new notebook in Google Colab
Go to the top menu and click Runtime
Choose Change runtime type
In the Language dropdown, select R
Hit Save to apply the changes

Step 2: Install and Load Required R Packages

In this step, we install all the essential R packages needed for data cleaning, visualization, and machine learning. Once installed, we load them into the session to make them ready for use. Here’s the code:

## ---------- Step 1: Install packages (run once) ----------

# If a package is already installed, R will just skip it.



packages <- c(

  "tidyverse",   # Data manipulation and visualization

  "caret",       # Machine learning utilities

  "janitor",     # Clean column names

  "skimr",       # Quick data summaries

  "DataExplorer",# Automated EDA tools

  "corrplot",    # Correlation plots

  "randomForest",# Random Forest algorithm

  "xgboost",     # XGBoost algorithm

  "pROC",        # ROC curves and AUC

  "vip"          # Variable importance plots

)



# Identify packages that are not yet installed

new_pkgs <- packages[!(packages %in% installed.packages()[,"Package"])]



# Install missing packages only

if(length(new_pkgs)) install.packages(new_pkgs, dependencies = TRUE)



## ---------- Step 2: Load packages ----------

# Load all the libraries into the current R session

library(tidyverse)

library(caret)

library(janitor)

library(skimr)

library(DataExplorer)

library(corrplot)

library(randomForest)

library(xgboost)

library(pROC)

library(vip)

The above step installs and loads the packages and libraries simultaneously. The output of this code is:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘R.methodsS3’, ‘R.oo’, ‘R.utils’, ‘bitops’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘R.cache’, ‘caTools’, ‘TH.data’, ‘profileModel’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘plotrix’, ‘diagram’, ‘lava’, ‘styler’, ‘labelled’, ‘gplots’, ‘libcoin’, ‘matrixStats’, ‘multcomp’, ‘wk’, ‘permute’, ‘rbibutils’, ‘FNN’, ‘mclust’, ‘multicool’, ‘pracma’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘sparsevctrs’, ‘timeDate’, ‘brglm’, ‘gtools’, ‘lme4’, ‘qvcalc’, ‘rex’, ‘Formula’, ‘plotmo’, ‘prodlim’, ‘combinat’, ‘questionr’, ‘ROCR’, ‘mvtnorm’, ‘modeltools’, ‘strucchange’, ‘coin’, ‘zoo’, ‘sandwich’, ‘ROSE’, ‘plogr’, ‘classInt’, ‘s2’, ‘units’, ‘extrafontdb’, ‘Rttf2pt1’, ‘data.tree’, ‘ca’, ‘colorspace’, ‘gclus’, ‘qap’, ‘registry’, ‘TSP’, ‘vegan’, ‘visNetwork’, ‘Rdpack’, ‘lmtest’, ‘coda’, ‘biglm’, ‘minqa’, ‘statmod’, ‘tweedie’, ‘xmlparsedata’, ‘ks’, ‘crosstalk’, ‘RcppArmadillo’, ‘measures’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘recipes’, ‘reshape2’, ‘BradleyTerry2’, ‘covr’, ‘Cubist’, ‘earth’, ‘ellipse’, ‘fastICA’, ‘gam’, ‘ipred’, ‘kernlab’, ‘klaR’, ‘mda’, ‘mlbench’, ‘MLmetrics’, ‘pamr’, ‘party’, ‘pls’, ‘proxy’, ‘RANN’, ‘spls’, ‘superpc’, ‘themis’, ‘snakecase’, ‘RSQLite’, ‘sf’, ‘tidygraph’, ‘extrafont’, ‘gridExtra’, ‘networkD3’, ‘nycflights13’, ‘seriation’, ‘prettydoc’, ‘DiagrammeR’, ‘Ckmeans.1d.dp’, ‘vcd’, ‘cplm’, ‘lintr’, ‘igraph’, ‘float’, ‘titanic’, ‘microbenchmark’, ‘logcondens’, ‘doParallel’, ‘vdiffr’, ‘yardstick’, ‘bookdown’, ‘DT’, ‘fastshap’, ‘modeldata’, ‘NeuralNetTools’, ‘pdp’, ‘tinytest’, ‘varImp’

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

lift

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

chisq.test, fisher.test

corrplot 0.95 loaded

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

combine

The following object is masked from ‘package:ggplot2’:

margin

Attaching package: ‘xgboost’

The following object is masked from ‘package:dplyr’:

slice

Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

cov, smooth, var

Attaching package: ‘vip’

The following object is masked from ‘package:utils’:

Here’s an R Project: How to Build an Uber Data Analysis Project in R

Step 3: Upload and Read the Dataset

This step loads the dataset into your R environment. We also keep a raw copy for reference and take a quick look at the structure and dimensions. The code for this section is:


## ---------- Step 3: Read the data ----------

# Set the file path (update this if your filename is different)

data_path <- "loan_data.csv"


# Read the CSV file into R, without converting strings to factors

loan_raw <- read.csv(data_path, stringsAsFactors = FALSE)


# Always create a backup of the raw data for reference

loan <- loan_raw


# View the first few rows

head(loan)


# Check the number of rows and columns

dim(loan)

The above code loads the dataset into the Colab notebook and gives us a glimpse of the data we will be working with.

	person_age	person_gender	person_education	person_income	person_emp_exp	person_home_ownership	loan_amnt	loan_intent	loan_int_rate	loan_percent_income	cb_person_cred_hist_length	credit_score	previous_loan_defaults_on_file	loan_status
	<dbl>	<chr>	<chr>	<dbl>	<int>	<chr>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>	<int>	<chr>	<int>
1	22	female	Master	71948	0	RENT	35000	PERSONAL	16.02	0.49	3	561	No	1
2	21	female	High School	12282	0	OWN	1000	EDUCATION	11.14	0.08	2	504	Yes	0
3	25	female	High School	12438	3	MORTGAGE	5500	MEDICAL	12.87	0.44	3	635	No	1
4	23	female	Bachelor	79753	0	RENT	35000	MEDICAL	15.23	0.44	2	675	No	1
5	24	male	Master	66135	1	RENT	35000	MEDICAL	14.27	0.53	4	586	No	1
6	21	female	High School	12951	0	OWN	2500	VENTURE	7.14	0.19	2	532	No	1

Step 4: Clean Column Names and Convert Target Variable

We now clean the dataset’s column names for easier access and convert the loan_status column into a labeled factor. This prepares the target variable for classification.

# Step 4: Read, clean and convert loan_status


# Set file path and read the CSV file

data_path <- "/content/loan_data.csv"

loan_raw <- read.csv(data_path, stringsAsFactors = FALSE)


# Load janitor and clean the column names (snake_case format)

library(janitor)

loan <- loan_raw %>%

  clean_names()


# Convert loan_status into a factor with labels: 0 = Rejected, 1 = Approved

loan$loan_status <- factor(loan$loan_status, levels = c(0, 1), labels = c("Rejected", "Approved"))


# Check the structure of the cleaned data

str(loan)


# View distribution of the target classes

table(loan$loan_status)

The output for the above code is:

'data.frame': 45000 obs. of 14 variables:

$ person_age : num 22 21 25 23 24 21 26 24 24 21 ...

$ person_gender : chr "female" "female" "female" "female" ...

$ person_education : chr "Master" "High School" "High School" "Bachelor" ...

$ person_income : num 71948 12282 12438 79753 66135 ...

$ person_emp_exp : int 0 0 3 0 1 0 1 5 3 0 ...

$ person_home_ownership : chr "RENT" "OWN" "MORTGAGE" "RENT" ...

$ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...

$ loan_intent : chr "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...

$ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...

$ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...

$ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...

$ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...

$ previous_loan_defaults_on_file: chr "No" "Yes" "No" "No" ...

$ loan_status : Factor w/ 2 levels "Rejected","Approved": 2 1 2 2 2 2 2 2 2 2 ...

Rejected Approved

35000 10000

The above output means that:

The dataset has 45,000 rows and 14 columns (i.e., 45,000 loan applications with 14 features each).
Variables include age, gender, income, loan amount, credit score, and more.
The target variable loan_status is a factor with two classes: Approved and Rejected.
Class distribution is imbalanced: 35,000 Rejected vs 10,000 Approved applications.

Step 5: Identify Missing Values

Before we handle missing data, we need to check how many missing values exist in each column. This helps us decide how to clean them. The code for this step is:

## Step 5: See total missing values per column

colSums(is.na(loan))  # Shows the number of NA values in each column

The output for this step is:

Person_age 0 person_gender 0 person_education 0 person_income 0 person_emp_exp 0 person_home_ownership 0 loan_amnt0loan_intent 0 loan_int_rate 0 loan_percent_income 0 cb_person_cred_hist_length 0 credit_score 0 previous_loan_defaults_on_file 0 loan_status 0

Step 6: Separate Numeric and Categorical Columns

To prepare for data preprocessing, we first identify which columns are numeric and which are categorical. Here’s the code for this step:

# Get target column

target_col <- "loan_status"


# Separate numeric and categorical columns

num_cols <- loan %>% select(where(is.numeric)) %>% names()    # Numeric features

cat_cols <- loan %>% select(where(is.character)) %>% names()  # Categorical features


# Print column types

cat("Numeric columns:\n", paste(num_cols, collapse = ", "), "\n\n")

cat("Categorical columns:\n", paste(cat_cols, collapse = ", "), "\n")

The output for the above code is:

Numeric columns:

person_age, person_income, person_emp_exp, loan_amnt, loan_int_rate, loan_percent_income, cb_person_cred_hist_length, credit_score

Categorical columns:

person_gender, person_education, person_home_ownership, loan_intent, previous_loan_defaults_on_file

Project in R: Car Data Analysis Project Using R

Step 7: Convert Categorical Columns to Factors

Machine learning models in R work better when categorical variables are encoded as factors. So we’ll convert the categorical columns into factors using the code:

## Step 4.3: Convert character columns to factors

loan[cat_cols] <- lapply(loan[cat_cols], factor)

# Check structure again to confirm the change

str(loan)

The output of this step is:

'data.frame': 45000 obs. of 14 variables:

$ person_age : num 22 21 25 23 24 21 26 24 24 21 ...

$ person_gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 1 1 1 ...

$ person_education : Factor w/ 5 levels "Associate","Bachelor",..: 5 4 4 2 5 4 2 4 1 4 ...

$ person_income : num 71948 12282 12438 79753 66135 ...

$ person_emp_exp : int 0 0 3 0 1 0 1 5 3 0 ...

$ person_home_ownership : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 3 1 4 4 3 4 4 4 3 ...

$ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...

$ loan_intent : Factor w/ 6 levels "DEBTCONSOLIDATION",..: 5 2 4 4 4 6 2 4 5 6 ...

$ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...

$ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...

$ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...

$ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...

$ previous_loan_defaults_on_file: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...

$ loan_status : Factor w/ 2 levels "Rejected","Approved": 2 1 2 2 2 2 2 2 2 2 ...

The above output means that:

The dataset has 45,000 rows and 14 columns.
Some columns are numeric (e.g., age, income).
Character columns are now converted to factors.
Factor columns show how many unique categories they have.

Step 8: Check Class Balance

This step checks how many loans were approved vs. rejected. It also shows what percentage each class represents, helping identify if the data is imbalanced. Here’s the code:

## Step 8: Class balance

table(loan_status = loan$loan_status)  # Count of Approved vs Rejected loans

prop.table(table(loan_status = loan$loan_status)) * 100  # Percentage distribution

The output for this step is given below:

loan_status

Rejected Approved

35000 10000

loan_status

Rejected Approved

77.77778 22.22222

Step 9: Summary Statistics for Numeric Features

This step gives a quick statistical summary (min, max, mean, median, etc.) for all numeric columns in the dataset. Here’s the code:

## Step 9: Summary stats for numeric features

summary(loan[num_cols])

Here’s the output for this step:

person_age person_income person_emp_exp loan_amnt

Min. : 20.00 Min. : 8000 Min. : 0.00 Min. : 500

1st Qu.: 24.00 1st Qu.: 47204 1st Qu.: 1.00 1st Qu.: 5000

Median : 26.00 Median : 67048 Median : 4.00 Median : 8000

Mean : 27.76 Mean : 80319 Mean : 5.41 Mean : 9583

3rd Qu.: 30.00 3rd Qu.: 95789 3rd Qu.: 8.00 3rd Qu.:12237

Max. :144.00 Max. :7200766 Max. :125.00 Max. :35000

loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score

Min. : 5.42 Min. :0.0000 Min. : 2.000 Min. :390.0

1st Qu.: 8.59 1st Qu.:0.0700 1st Qu.: 3.000 1st Qu.:601.0

Median :11.01 Median :0.1200 Median : 4.000 Median :640.0

Mean :11.01 Mean :0.1397 Mean : 5.867 Mean :632.6

3rd Qu.:12.99 3rd Qu.:0.1900 3rd Qu.: 8.000 3rd Qu.:670.0

Max. :20.00 Max. :0.6600 Max. :30.000 Max. :850.0

Also Read: 18 Types of Regression in Machine Learning You Should Know

Step 10: Visual Exploration of Loan Data

In this step, we visualize the dataset to understand patterns between features and loan approval status. Here’s the code for this step:


library(ggplot2)


# Plot: Loan Status by Loan Intent

if("loan_intent" %in% names(loan)) {

  ggplot(loan, aes(x = loan_intent, fill = loan_status)) +

    geom_bar(position = "fill") +

    labs(title = "Loan Status by Loan Intent", y = "Proportion") +

    theme_minimal()

}


# Plot: Applicant Income Distribution

if("person_income" %in% names(loan)) {

  ggplot(loan, aes(x = person_income)) +

    geom_histogram(bins = 30, fill = "skyblue", color = "black") +

    labs(title = "Applicant Income Distribution") +

    theme_minimal()

}


# Boxplot: Loan Amount by Loan Status

if("loan_amnt" %in% names(loan)) {

  ggplot(loan, aes(x = loan_status, y = loan_amnt, fill = loan_status)) +

    geom_boxplot() +

    labs(title = "Loan Amount by Loan Status") +

    theme_minimal()

}

The above code generates 3 graphs showing different results.

Popular Data Science Programs

Post Graduate Certificate in Data Science PGD in Data Science Data Science Machine Learning Course M Sc in Data Science Degree Cloud Computing Courses Certification

1. Loan Status by Loan Intent (Stacked Bar Chart)

Shows the proportion of approved vs. rejected loans across different loan purposes (e.g., personal, education).
Helps identify which loan intents have higher approval or rejection rates.

2. Applicant Income Distribution (Histogram)

Displays how applicant incomes are spread out across the dataset.
Helps detect if the data is skewed or if there are many low/high-income applicants.

3. Loan Amount by Loan Status (Boxplot)

Compares the distribution of loan amounts for approved vs. rejected applications.
Helps see if higher or lower loan amounts are more likely to be approved.

Improve Your R Skills: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 11: Check Correlation Between Numeric Features

In this step, we’ll generate a heatmap to explore how numeric features relate to one another. This helps identify patterns or highly related variables. Here’s the code:

# Load correlation plot library

library(corrplot)


# Correlation matrix

if(length(num_cols) > 1){

  corr_mat <- cor(loan[num_cols])

  corrplot(corr_mat, method = "color", type = "upper", tl.cex = 0.8)

}

Here’s the graph for the above code:

From the above plot:

Key numeric relationships are revealed – For instance, people with higher credit scores tend to have a lower percentage of income going toward loans.
Helps in feature selection – We can identify which variables are closely related, so we avoid multicollinearity in modeling and focus on the most useful predictors.

Step 12: Handle Missing Values

We clean the dataset by replacing missing values, numeric columns with their median, and categorical columns with their most frequent value (mode). The code for this step is:



# Helper function to compute the mode

get_mode <- function(x) {

  ux <- unique(x[!is.na(x)])

  ux[which.max(tabulate(match(x, ux)))]

}



# Make a copy of the data so we keep the original safe

loan_clean <- loan



# Impute numeric columns with median

for (col in num_cols) {

  med <- median(loan_clean[[col]], na.rm = TRUE)

  loan_clean[[col]][is.na(loan_clean[[col]])] <- med

}



# Impute categorical columns with mode

for (col in cat_cols) {

  mode_val <- get_mode(loan_clean[[col]])

  loan_clean[[col]][is.na(loan_clean[[col]]) | loan_clean[[col]] == ""] <- mode_val

}



# Confirm no missing values now

colSums(is.na(loan_clean))

The output for the above step is:

Person_age 0 person_gender 0 person_education 0 person_income 0 person_emp_exp 0 person_home_ownership 0 loan_amnt 0 loan_intent 0 loan_int_rate 0 loan_percent_income 0 cb_person_cred_hist_length 0 credit_score 0 previous_loan_defaults_on_file 0 loan_status 0

Here’s an Interesting R Project: Movie Rating Analysis Project in R

Step 13: Split the Data into Training and Test Sets

We split the cleaned data into 80% training and 20% test sets using stratified sampling to maintain class balance. Here’s the code:


# Load caret if not already loaded

library(caret)



# Set seed for reproducibility

set.seed(123)



# Create stratified split

index <- createDataPartition(loan_clean$loan_status, p = 0.8, list = FALSE)



# Create training and testing datasets

train <- loan_clean[index, ]

test  <- loan_clean[-index, ]



# Check dimensions

cat("Train rows:", nrow(train), "\n")

cat("Test rows:", nrow(test), "\n")

The output of the above step is:

Train rows: 36000

Test rows: 9000

Step 14: Set Up Model Training Control Parameters

Before training any model, we configure how it should be validated using 5-fold cross-validation repeated 2 times. We also set it up to calculate probabilities for better performance evaluation. Here’s the code:



# Set up training control for 5-fold cross-validation, repeated twice

ctrl <- trainControl(

  method = "repeatedcv",

  number = 5,

  repeats = 2,

  classProbs = TRUE,               # for probability-based metrics

  summaryFunction = twoClassSummary # for ROC, Sensitivity, Specificity

)

The above step:

Prepares the model for fair, repeated cross-validation.
Enables us to evaluate with ROC and other class metrics.

Step 15: Train the Logistic Regression Model

We now train a logistic regression model using all features to predict whether a loan will be approved. Logistic regression is a simple and interpretable baseline model. Here’s the code:

# Make sure the target has correct reference level (positive class first)

train$loan_status <- relevel(train$loan_status, ref = "Approved")



# Build formula: loan_status ~ all other columns

formula_all <- loan_status ~ .



# Train logistic regression

set.seed(123)

fit_glm <- train(

  formula_all,

  data = train,

  method = "glm",

  family = binomial,

  trControl = ctrl,

  metric = "ROC"

)



# View model summary

fit_glm

Here’s the output:

Generalized Linear Model

36000 samples

13 predictor

2 classes: 'Approved', 'Rejected'

No pre-processing

Resampling: Cross-Validated (5 fold, repeated 2 times)

Summary of sample sizes: 28800, 28800, 28800, 28800, 28800, 28800, ...

Resampling results:

ROC Sens Spec

0.9539318 0.7500625 0.9377143

The above output means that:

ROC (0.95):
Excellent model performance, 95% chance the model correctly distinguishes between approved and rejected loans.
Sensitivity (0.75):
The model correctly identifies 75% of approved loans.
Specificity (0.94):
The model correctly identifies 94% of rejected loans.

Step 16: Make Predictions on the Test Data

Before evaluating, we first generate predictions using the trained model. Here’s the code:

# Predict class labels

pred_class <- predict(fit_glm, newdata = test)



# Predict probabilities (probability of being "Approved")

pred_prob <- predict(fit_glm, newdata = test, type = "prob")[, "Approved"]

Here we are making predictions using the trained logistic regression model on the unseen test data. Here's what each part does:

pred_class: Predicts whether a loan in the test set will be Approved or Rejected; this is the final classification.
pred_prob: Gives the confidence level (probability) that a loan will be approved, which is useful for ROC and AUC evaluation.

Step 17: Evaluate Model with Confusion Matrix

We now assess how well the model performs using a confusion matrix, which compares actual vs. predicted outcomes. Here’s the code:

# Confusion Matrix

confusionMatrix(pred_class, test$loan_status, positive = "Approved")

The output for this step is:

Warning message in confusionMatrix.default(pred_class, test$loan_status, positive = "Approved"):

“Levels are not in the same order for reference and data. Refactoring data to match.”

Confusion Matrix and Statistics

Reference

Prediction Rejected Approved

Rejected 6567 488

Approved 433 1512

Accuracy : 0.8977

95% CI : (0.8912, 0.9039)

No Information Rate : 0.7778

P-Value [Acc > NIR] : < 2e-16

Kappa : 0.701

Mcnemar's Test P-Value : 0.07518

Sensitivity : 0.7560

Specificity : 0.9381

Pos Pred Value : 0.7774

Neg Pred Value : 0.9308

Prevalence : 0.2222

Detection Rate : 0.1680

Detection Prevalence : 0.2161

Balanced Accuracy : 0.8471

'Positive' Class : Approved

The above output means that:

The model has an overall accuracy of 89.77%, meaning it correctly predicts most loan outcomes.
It is very good at identifying rejected loans with a specificity of 93.81%.
It performs reasonably well in detecting approved loans with a sensitivity of 75.60%.
The positive predictive value (precision for approved loans) is 77.74%, indicating that most predicted approvals are correct.
The balanced accuracy is 84.71%, showing the model performs well across both classes, not just one.

Here’s an R Project: Wine Quality Prediction Project in R

Step 18: Plot ROC Curve and Get AUC Score

This step helps visualize how well the model separates approved and rejected loans. We also compute the AUC score to summarize model performance. This is the code for this step:


library(pROC)



# Create ROC object

roc_obj <- roc(response = test$loan_status, predictor = pred_prob, levels = c("Rejected", "Approved"))



# Plot ROC

plot(roc_obj, col = "blue", main = "ROC Curve - Logistic Regression")



# AUC score

auc(roc_obj)

The graph for the above code is:

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

The above graph shows:

The ROC Curve shows the trade-off between sensitivity and specificity.
The AUC score (Area Under the Curve) tells how well the model distinguishes between the two classes; closer to 1 means better performance.

Conclusion

In this Loan Approval Classification project, we built a logistic regression model in R using Google Colab to predict whether a loan would be approved or rejected based on applicant and loan-related features.

After handling missing values, encoding categorical variables, and splitting the data, we trained the model with 5-fold repeated cross-validation. The model was evaluated using a confusion matrix, ROC curve, and AUC score.

It achieved an accuracy of 89.77% and an AUC of around 0.95, showing strong performance in classifying loan approvals accurately while maintaining a good balance between sensitivity and specificity.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1kMt6Goyje9PMP9bXMjsMk1khpRiGzxN3#scrollTo=avkmZQTSOc4z

Frequently Asked Questions (FAQs)

1. What is the goal of this Loan Approval Classification project in R?

The primary objective of this project is to build a machine learning model that can accurately predict whether a loan will be approved or rejected based on applicant and loan-related features. Logistic regression is used for classification, and the model is evaluated using metrics such as ROC-AUC and a confusion matrix.

2. Which tools and R packages are used for this project?

This project is developed using R in Google Colab. The main tools and libraries used include:

caret for model training, cross-validation, and evaluation
dplyr and tidyr for data wrangling
ggplot2 for data visualization
pROC for ROC curve and AUC score
corrplot for visualizing correlations among numeric variables

3. Can other algorithms be used instead of logistic regression?

Absolutely. While logistic regression works well for binary classification, you can experiment with other algorithms to improve performance or compare results. These include:

Random Forest
Support Vector Machines (SVM)
Gradient Boosting Machines (GBM)
XGBoost
Naive Bayes

4. What are some other beginner-friendly classification projects in R?

If you're exploring more R-based projects to expand your skillset, here are a few great alternatives to try:

5. How is model performance measured in this project?

The model's performance is evaluated through several metrics:

Confusion Matrix for accuracy, sensitivity (recall), and specificity
ROC Curve and AUC for measuring the model’s ability to distinguish between classes
Repeated 5-fold Cross-validation for more reliable performance estimates

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources