• Home
  • Blog
  • Data Science
  • Loan Prediction Project: Build a Model to Predict Loan Approvals with Confidence

Loan Prediction Project: Build a Model to Predict Loan Approvals with Confidence

By Rohit Sharma

Updated on Jul 22, 2025 | 11 min read | 1.3K+ views

Share:

Loan prediction is a problem that is being faced by most of the financial institutions in the real world. Given historical data of prior applicants, the objective is to predict loan approval. 

The aim of this project will be to create a machine learning model that predicts whether a loan application will be approved or not. We will begin by exploring and preprocessing the dataset, imputing missing values, and encoding categorical variables, and then we will conduct Exploratory Data Analysis (EDA). After that, we will train and verify different classification models like Logistic Regression, Decision Tree, and Random Forest to select the best-performing classifier.

Explore more project ideas such as this one in our blog post - Top 25+ Essential Data Science Projects GitHub to Explore in 2025.

What Should You Know Beforehand?

It is better to have at least some background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Technology/Library

Purpose

Python

Core programming language for data processing and ML

Pandas

Data loading, exploration, and manipulation

NumPy

Numerical operations and array handling

Matplotlib

Basic plotting and visualization

Seaborn

Enhanced statistical data visualization

Scikit-learn

Machine learning models, preprocessing, and evaluation

Google Colab

Writing and running Python code in a cloud-based, shareable notebook

Models That Will Be Utilized for Learning

We will be using these:

  • Logistic RegressionThis is a very common binary classification algorithm. Ideal for predicting categorical outcomes like Yes or No. It finds a probability of the respective target class using a linear decision boundary. 
  • Decision Tree ClassifierThis is a flowchart-like algorithm. It operates by splitting data based on features' values. Very intuitive and works well for classification problems involving structured datasets such as customer income, credit history, or employment status.
  • Random Forest ClassifierIt is an ensemble technique that builds multiple decision trees and aggregates their output. This helps reduce the possibility of overfitting and helps generalize well on financial datasets.
  • Support Vector Machine (SVM)This is a powerful classification model. It finds the optimal hyperplane to separate loan applicants into approved and rejected categories.

Time Taken and Difficulty

The project will approximately take 3-4 hours to complete. It is perfect for beginners who want to practice data handling, classification techniques, and model evaluation.

How to Build a Loan Prediction Model

Let’s start building the project from scratch. We will start by:

  1. Loading the dataset
  2. Exploring and cleaning the data
  3. Building and training multiple classification models
  4. Evaluating model performance

Without any further delay, let’s start!

Step 1: Download the Dataset

To train our loan prediction model, we will use the dataset available on Kaggle. It contains two .csv files - train_u6lujuX_CVtuZ9i.csv and test_Y3wMUE5_7gLdaTN.csv. The training dataset file contains applicant details along with their loan status. Meanwhile, the testing dataset file contains only applicant details. 

Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset/data. 
  3. On the Loan Prediction Problem Dataset page, in the right pane, under the Data Explorer section, click test_Y3wMUE5_7gLdaTN.csv
  4. Click the download icon
  5. Once downloaded, click on the train_u6lujuX_CVtuZ9i.csv file. 
  6. Click the download icon.

Step 2: Upload & Load the Dataset

Upload the downloaded files to Google Colab. Use the below code to do so:

from google.colab import files
uploaded = files.upload()

Now, load the data using the code below.

import pandas as pd

# Load the training and test datasets
train_df = pd.read_csv("train_u6lujuX_CVtuZ9i.csv")
test_df = pd.read_csv("test_Y3wMUE5_7gLdaTN.csv")

# Show data preview
print("Training Data Sample:")
print(train_df.head())

Output:

Training Data Sample:

    Loan_ID Gender Married Dependents     Education Self_Employed  \

0  LP001002   Male      No          0      Graduate            No   

1  LP001003   Male     Yes          1      Graduate            No   

2  LP001005   Male     Yes          0      Graduate           Yes   

3  LP001006   Male     Yes          0  Not Graduate            No   

4  LP001008   Male      No          0      Graduate            No   

 

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \

0             5849                0.0         NaN             360.0   

1             4583             1508.0       128.0             360.0   

2             3000                0.0        66.0             360.0   

3             2583             2358.0       120.0             360.0   

4             6000                0.0       141.0             360.0   

 

   Credit_History Property_Area Loan_Status  

0             1.0         Urban           Y  

1             1.0         Rural           N  

2             1.0         Urban           Y  

3             1.0         Urban           Y  

4             1.0         Urban           Y  

Step 3: Understand the Dataset Structure

Before moving ahead, let’s inspect the dataset structure, column names, and missing values. Use the code below to get a high-level overview of the data and pinpoint any issues, if there.

# Check shape of the dataset
print("Dataset shape:", train_df.shape)

# List all column names
print("Column names:", train_df.columns.tolist())

# Check for missing values
print("Missing values:\n", train_df.isnull().sum())

# Summary statistics of numerical columns
train_df.describe()

Output:

Dataset shape: (614, 13)

Column names: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']

Missing values:

Loan_ID               0

Gender               13

Married               3

Dependents           15

Education             0

Self_Employed        32

ApplicantIncome       0

CoapplicantIncome     0

LoanAmount           22

Loan_Amount_Term     14

Credit_History       50

Property_Area         0

Loan_Status           0

dtype: int64

 

ApplicantIncome

CoapplicantIncome

LoanAmount

Loan_Amount_Term

Credit_History

count

614.000000

614.000000

592.000000

600.00000

564.000000

mean

5403.459283

1621.245798

146.412162

342.00000

0.842199

std

6109.041673

2926.248369

85.587325

65.12041

0.364878

min

150.000000

0.000000

9.000000

12.00000

0.000000

25%

2877.500000

0.000000

100.000000

360.00000

1.000000

50%

3812.500000

1188.500000

128.000000

360.00000

1.000000

75%

5795.000000

2297.250000

168.000000

360.00000

1.000000

max

81000.000000

41667.000000

700.000000

480.00000

1.000000

 

 

What does this output mean?

The output tells us that:

  • The dataset has 614 rows and 13 columns. 
  • Some key columns like Gender, Married, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History have missing values.
  • Credit_History has the most missing entries - 50 rows.
  • ApplicantIncome ranges from ₹150 to ₹81,000
  • LoanAmount varies widely, with a median around - ₹128K and a maximum of - ₹700K.
  • Credit_History mean is approx. 0.84. Most applicants have a favorable history (1).
background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Step 4: Handling Missing Values

In this step, we will impute the missing data to avoid errors and improve accuracy. This is necessary before building our model. To do so, we will:

  • Replace missing categorical values with the most frequent entry (mode).
  • Substitute absent numerical values with the median. Doing so will help us avoid outliers skewing the data.
  • Lastly, confirm no missing values remain.

Use the below given code, to accomplish all this.

# Fill missing categorical values with the mode
for column in ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Credit_History']:
    train_df[column].fillna(train_df[column].mode()[0], inplace=True)

# Fill missing numerical values with the median
train_df['LoanAmount'].fillna(train_df['LoanAmount'].median(), inplace=True)
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median(), inplace=True)

# Confirm missing values are handled
print("Missing values after cleaning:\n", train_df.isnull().sum())

Output:

Missing values after cleaning:

Loan_ID              0

Gender               0

Married              0

Dependents           0

Education            0

Self_Employed        0

ApplicantIncome      0

CoapplicantIncome    0

LoanAmount           0

Loan_Amount_Term     0

Credit_History       0

Property_Area        0

Loan_Status          0

dtype: int64

/tmp/ipython-input-7-1566441025.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

  train_df[column].fillna(train_df[column].mode()[0], inplace=True)

/tmp/ipython-input-7-1566441025.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

  train_df['LoanAmount'].fillna(train_df['LoanAmount'].median(), inplace=True)

/tmp/ipython-input-7-1566441025.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

  train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median(), inplace=True)

Step 5: Encode Categorical Features

Machine learning does not work with strings. It works with numbers. So in this step, we will convert categorical or text-based columns like Gender, Married, Education, etc., into numerical values. We will do this by utilizing Label Encoding. 

Here is the code to do so:

from sklearn.preprocessing import LabelEncoder

# Create a label encoder object
le = LabelEncoder()

# List of categorical columns
categorical_cols = ['Gender', 'Married', 'Dependents', 'Education', 
                    'Self_Employed', 'Property_Area', 'Loan_Status']

# Apply label encoding to each column
for col in categorical_cols:
    train_df[col] = le.fit_transform(train_df[col])
train_df.head()

Output:

 

Loan_ID

Gender

Married

Dependents

Education

Self_Employed

ApplicantIncome

CoapplicantIncome

LoanAmount

Loan_Amount_Term

Credit_History

Property_Area

Loan_Status

0

LP001002

1

0

0

0

0

5849

0.0

128.0

360.0

1.0

2

1

1

LP001003

1

1

1

0

0

4583

1508.0

128.0

360.0

1.0

0

0

2

LP001005

1

1

0

0

1

3000

0.0

66.0

360.0

1.0

2

1

3

LP001006

1

1

0

1

0

2583

2358.0

120.0

360.0

1.0

2

1

4

LP001008

1

0

0

0

0

6000

0.0

141.0

360.0

1.0

2

1

 

 

Step 6: Select Features and Split the Data

Our dataset is clean and encoded. Therefore, now we will:

  1. Choose relevant input features, i.e., independent variables.
  2. Define the target variable, which in this case is the Loan_Status.
  3. Split the data into training and testing sets. This is required for model evaluation.

To accomplish all this, use the code given:

from sklearn.model_selection import train_test_split

# Select input features (X) and target variable (y)
X = train_df.drop('Loan_Status', axis=1)  # All columns except the target
y = train_df['Loan_Status']               # Target column

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Print shape to verify
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Output:

Training set shape: (491, 12)

Test set shape: (123, 12)

What does this output mean?

It means that:

  • The training set has 491 rows (loan applications) and 12 columns (features like Gender, Income, LoanAmount, etc.).
  • The test set has 123 rows and the same 12 columns.
  • The dataset has been split into a training set and a test set using an 80-20 ratio. 

Step 7: Train 4 ML Models for Loan Prediction

We will now train four popular classification models: Logistic Regression, Decision Tree, Random Forest, and SVM. These will help us understand which algorithm performs best for our dataset.

Model 1: Logistic Regression

Use the code below to train this classification model:

from sklearn.linear_model import LogisticRegression

# Train logistic regression model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)

# Predict on test set
log_pred = log_model.predict(X_test)

Output:

/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

    https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

  n_iter_i = _check_optimize_result(

What does this output mean?

The output means that the logistic regression model didn’t fully converge within the specified number of iterations (max_iter=1000). Why? This could have happened because of:

  • Unscaled numeric values (example - ApplicantIncome, LoanAmount) 
  • Need more iterations.

Model 2: Decision Tree Classifier

Use the code below to train this model:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train the Decision Tree model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# Predict on test data
tree_preds = tree_model.predict(X_test)

# Calculate accuracy
tree_acc = accuracy_score(y_test, tree_preds)

# Show results
print("Decision Tree Classifier Results:")
print("Predictions:", tree_preds)
print("Accuracy:", round(tree_acc * 100, 2), "%\n")

Output:

Decision Tree Classifier Results:

Predictions: [1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 0 1

 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1

 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0

 0 1 1 0 0 1 1 0 0 0 1 0]

Accuracy: 69.11 %

What does this output mean?

It means that the decision tree model correctly predicted the loan status for about 69 out of every 100 samples in the test set. 1 means loan approved, whereas 0 means loan not approved.

Model 3: Random Forest Classifier

Use the code below to train this model:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train random forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on test set
rf_pred = rf_model.predict(X_test)

# Calculate accuracy
rf_acc = accuracy_score(y_test, rf_pred)

# Print results
print("Random Forest Classifier Results:")
print("Predictions:", rf_pred)
print("Accuracy:", round(rf_acc * 100, 2), "%")

Output:

Random Forest Classifier Results:

Predictions: [1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1

 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1

 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1

 0 1 1 1 1 1 1 1 0 1 1 1]

Accuracy: 75.61 %

What does the output mean?

The output means that the model correctly predicted the loan status for about 76 out of every 100 test cases. This is better than your Decision Tree accuracy (69.11%).

Model 4: Support Vector Machine 

Use the code below to train this model:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train SVM model
svm_model = SVC(kernel='linear', probability=True)
svm_model.fit(X_train, y_train)

# Predict on test set
svm_pred = svm_model.predict(X_test)

# Calculate accuracy
svm_acc = accuracy_score(y_test, svm_pred)

# Print results
print("Support Vector Machine (SVM) Results:")
print("Predictions:", svm_pred)
print("Accuracy:", round(svm_acc * 100, 2), "%")

Output:

Support Vector Machine (SVM) Results:

Predictions: [1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1

 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1

 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1

 0 1 1 1 1 1 1 1 1 1 1 1]

Accuracy: 79.67 %

What does the output mean?

The output means that the model correctly predicted the loan status for about 80 out of every 100 test cases. Best performing model.

Step 8: Model Evaluation 

Now, we will evaluate how well each model performed on unseen test data. But before that let’s define all four accuracy scores properly before printing. Use the below code to do so:

from sklearn.metrics import accuracy_score

# Logistic Regression Accuracy
log_acc = accuracy_score(y_test, log_pred)

# Decision Tree Accuracy
tree_acc = accuracy_score(y_test, tree_preds)

# Random Forest Accuracy
rf_acc = accuracy_score(y_test, rf_pred)

# SVM Accuracy
svm_acc = accuracy_score(y_test, svm_pred)

Now that we have defined those, use the code below to evaluate how well each model performed. 

print("Model Performance Summary:")
print(f"Logistic Regression Accuracy: {log_acc * 100:.2f}%")
print(f"Decision Tree Accuracy:       {tree_acc * 100:.2f}%")
print(f"Random Forest Accuracy:       {rf_acc * 100:.2f}%")
print(f"SVM Accuracy:                 {svm_acc * 100:.2f}%")

Output:

Model Performance Summary:

Logistic Regression Accuracy: 78.86%

Decision Tree Accuracy:       69.11%

Random Forest Accuracy:       75.61%

SVM Accuracy:                 79.67%

What does this output mean?

The Support Vector Machine (SVM) performed best. It correctly predicted loan approval status for nearly 80 out of every 100 cases.

Conclusion

Among the four models under testing, Support Vector Machine gave an accuracy of 79.67% for which it stands as the best performer in this loan prediction task. Logistic Regression almost duplicated its performance, but a convergence warning came up, which tells that it might be improved by more hyperparameter tuning or feature scaling. Decision Tree and Random Forest performed somewhere in the middle, but are less proficient than SVM.

These models can be further fine-tuned to give a better consideration of real-world performance to predict loan eligibility through optimization and cross-validation.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://colab.research.google.com/drive/1O3aZYd-7SKQN5W8H6hH1E6IKFrKhKEUQ

Frequently Asked Questions (FAQs)

1. What is the objective of a Loan Prediction project?

2. Which tools and libraries are used in this project?

3. What machine learning algorithms can be used to improve the model?

4. What similar data science project ideas can I explore next?

Rohit Sharma

804 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months