Home
Blog
Data Science
Loan Prediction Project: Build a Model to Predict Loan Approvals with Confidence

Loan Prediction Project: Build a Model to Predict Loan Approvals with Confidence

Updated on Aug 05, 2025 | 11 min read | 1.44K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty
How to Build a Loan Prediction Model
Conclusion

Loan prediction is a problem that is being faced by most of the financial institutions in the real world. Given historical data of prior applicants, the objective is to predict loan approval.

The aim of this project will be to create a machine learning model that predicts whether a loan application will be approved or not. We will begin by exploring and preprocessing the dataset, imputing missing values, and encoding categorical variables, and then we will conduct Exploratory Data Analysis (EDA). After that, we will train and verify different classification models like Logistic Regression, Decision Tree, and Random Forest to select the best-performing classifier.

Explore more project ideas such as this one in our blog post - Top 25+ Essential Data Science Projects GitHub to Explore in 2025.

Popular Data Science Programs

Masters in Data Science Degree PG Diploma in Data Science DevOps Course Online Advanced Certificate Program in Data Science MSc in Data Science Program

What Should You Know Beforehand?

It is better to have at least some background in:

Python programming language (functions, loops, data types)
Pandas and Numpy (for data manipulation and numerical operations)
Scikit-learn (for basic machine learning tasks like training and prediction)
Data visualization (using libraries like Matplotlib or Seaborn)
Classification problems (binary classification, evaluation metrics like accuracy and confusion matrix)

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Technology/Library	Purpose
Python	Core programming language for data processing and ML
Pandas	Data loading, exploration, and manipulation
NumPy	Numerical operations and array handling
Matplotlib	Basic plotting and visualization
Seaborn	Enhanced statistical data visualization
Scikit-learn	Machine learning models, preprocessing, and evaluation
Google Colab	Writing and running Python code in a cloud-based, shareable notebook

Models That Will Be Utilized for Learning

We will be using these:

Logistic Regression: This is a very common binary classification algorithm. Ideal for predicting categorical outcomes like Yes or No. It finds a probability of the respective target class using a linear decision boundary.
Decision Tree Classifier: This is a flowchart-like algorithm. It operates by splitting data based on features' values. Very intuitive and works well for classification problems involving structured datasets such as customer income, credit history, or employment status.
Random Forest Classifier: It is an ensemble technique that builds multiple decision trees and aggregates their output. This helps reduce the possibility of overfitting and helps generalize well on financial datasets.
Support Vector Machine (SVM): This is a powerful classification model. It finds the optimal hyperplane to separate loan applicants into approved and rejected categories.

Time Taken and Difficulty

The project will approximately take 3-4 hours to complete. It is perfect for beginners who want to practice data handling, classification techniques, and model evaluation.

How to Build a Loan Prediction Model

Let’s start building the project from scratch. We will start by:

Loading the dataset
Exploring and cleaning the data
Building and training multiple classification models
Evaluating model performance

Without any further delay, let’s start!

Step 1: Download the Dataset

To train our loan prediction model, we will use the dataset available on Kaggle. It contains two .csv files - train_u6lujuX_CVtuZ9i.csv and test_Y3wMUE5_7gLdaTN.csv. The training dataset file contains applicant details along with their loan status. Meanwhile, the testing dataset file contains only applicant details.

Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset/data.
On the Loan Prediction Problem Dataset page, in the right pane, under the Data Explorer section, click test_Y3wMUE5_7gLdaTN.csv.
Click the download icon.
Once downloaded, click on the train_u6lujuX_CVtuZ9i.csv file.
Click the download icon.

Step 2: Upload & Load the Dataset

Upload the downloaded files to Google Colab. Use the below code to do so:

from google.colab import files
uploaded = files.upload()

Now, load the data using the code below.

import pandas as pd

# Load the training and test datasets
train_df = pd.read_csv("train_u6lujuX_CVtuZ9i.csv")
test_df = pd.read_csv("test_Y3wMUE5_7gLdaTN.csv")

# Show data preview
print("Training Data Sample:")
print(train_df.head())

Output:

Training Data Sample:

Loan_ID Gender Married Dependents Education Self_Employed \

0 LP001002 Male No 0 Graduate No

1 LP001003 Male Yes 1 Graduate No

2 LP001005 Male Yes 0 Graduate Yes

3 LP001006 Male Yes 0 Not Graduate No

4 LP001008 Male No 0 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

0 5849 0.0 NaN 360.0

1 4583 1508.0 128.0 360.0

2 3000 0.0 66.0 360.0

3 2583 2358.0 120.0 360.0

4 6000 0.0 141.0 360.0

Credit_History Property_Area Loan_Status

0 1.0 Urban Y

1 1.0 Rural N

2 1.0 Urban Y

3 1.0 Urban Y

4 1.0 Urban Y

Step 3: Understand the Dataset Structure

Before moving ahead, let’s inspect the dataset structure, column names, and missing values. Use the code below to get a high-level overview of the data and pinpoint any issues, if there.

# Check shape of the dataset
print("Dataset shape:", train_df.shape)

# List all column names
print("Column names:", train_df.columns.tolist())

# Check for missing values
print("Missing values:\n", train_df.isnull().sum())

# Summary statistics of numerical columns
train_df.describe()

Output:

Dataset shape: (614, 13)

Column names: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']

Missing values:

Loan_ID 0

Gender 13

Married 3

Dependents 15

Education 0

Self_Employed 32

ApplicantIncome 0

CoapplicantIncome 0

LoanAmount 22

Loan_Amount_Term 14

Credit_History 50

Property_Area 0

Loan_Status 0

dtype: int64

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	614.000000	614.000000	592.000000	600.00000	564.000000
mean	5403.459283	1621.245798	146.412162	342.00000	0.842199
std	6109.041673	2926.248369	85.587325	65.12041	0.364878
min	150.000000	0.000000	9.000000	12.00000	0.000000
25%	2877.500000	0.000000	100.000000	360.00000	1.000000
50%	3812.500000	1188.500000	128.000000	360.00000	1.000000
75%	5795.000000	2297.250000	168.000000	360.00000	1.000000
max	81000.000000	41667.000000	700.000000	480.00000	1.000000

What does this output mean?

The output tells us that:

The dataset has 614 rows and 13 columns.
Some key columns like Gender, Married, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History have missing values.
Credit_History has the most missing entries - 50 rows.
ApplicantIncome ranges from ₹150 to ₹81,000
LoanAmount varies widely, with a median around - ₹128K and a maximum of - ₹700K.
Credit_History mean is approx. 0.84. Most applicants have a favorable history (1).

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Step 4: Handling Missing Values

In this step, we will impute the missing data to avoid errors and improve accuracy. This is necessary before building our model. To do so, we will:

Replace missing categorical values with the most frequent entry (mode).
Substitute absent numerical values with the median. Doing so will help us avoid outliers skewing the data.
Lastly, confirm no missing values remain.

Use the below given code, to accomplish all this.

# Fill missing categorical values with the mode
for column in ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Credit_History']:
    train_df[column].fillna(train_df[column].mode()[0], inplace=True)

# Fill missing numerical values with the median
train_df['LoanAmount'].fillna(train_df['LoanAmount'].median(), inplace=True)
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median(), inplace=True)

# Confirm missing values are handled
print("Missing values after cleaning:\n", train_df.isnull().sum())

Output:

Missing values after cleaning:

Loan_ID 0

Gender 0

Married 0

Dependents 0

Education 0

Self_Employed 0

ApplicantIncome 0

CoapplicantIncome 0

LoanAmount 0

Loan_Amount_Term 0

Credit_History 0

Property_Area 0

Loan_Status 0

dtype: int64

/tmp/ipython-input-7-1566441025.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

train_df[column].fillna(train_df[column].mode()[0], inplace=True)

/tmp/ipython-input-7-1566441025.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

train_df['LoanAmount'].fillna(train_df['LoanAmount'].median(), inplace=True)

/tmp/ipython-input-7-1566441025.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median(), inplace=True)

Step 5: Encode Categorical Features

Machine learning does not work with strings. It works with numbers. So in this step, we will convert categorical or text-based columns like Gender, Married, Education, etc., into numerical values. We will do this by utilizing Label Encoding.

Here is the code to do so:

from sklearn.preprocessing import LabelEncoder

# Create a label encoder object
le = LabelEncoder()

# List of categorical columns
categorical_cols = ['Gender', 'Married', 'Dependents', 'Education', 
                    'Self_Employed', 'Property_Area', 'Loan_Status']

# Apply label encoding to each column
for col in categorical_cols:
    train_df[col] = le.fit_transform(train_df[col])
train_df.head()

Output:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	1	0	0	0	0	5849	0.0	128.0	360.0	1.0	2	1
1	LP001003	1	1	1	0	0	4583	1508.0	128.0	360.0	1.0	0	0
2	LP001005	1	1	0	0	1	3000	0.0	66.0	360.0	1.0	2	1
3	LP001006	1	1	0	1	0	2583	2358.0	120.0	360.0	1.0	2	1
4	LP001008	1	0	0	0	0	6000	0.0	141.0	360.0	1.0	2	1

Step 6: Select Features and Split the Data

Our dataset is clean and encoded. Therefore, now we will:

Choose relevant input features, i.e., independent variables.
Define the target variable, which in this case is the Loan_Status.
Split the data into training and testing sets. This is required for model evaluation.

To accomplish all this, use the code given:

from sklearn.model_selection import train_test_split

# Select input features (X) and target variable (y)
X = train_df.drop('Loan_Status', axis=1)  # All columns except the target
y = train_df['Loan_Status']               # Target column

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Print shape to verify
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Output:

Training set shape: (491, 12)

Test set shape: (123, 12)

What does this output mean?

It means that:

The training set has 491 rows (loan applications) and 12 columns (features like Gender, Income, LoanAmount, etc.).
The test set has 123 rows and the same 12 columns.
The dataset has been split into a training set and a test set using an 80-20 ratio.

Step 7: Train 4 ML Models for Loan Prediction

We will now train four popular classification models: Logistic Regression, Decision Tree, Random Forest, and SVM. These will help us understand which algorithm performs best for our dataset.

Model 1: Logistic Regression

Use the code below to train this classification model:

from sklearn.linear_model import LogisticRegression

# Train logistic regression model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)

# Predict on test set
log_pred = log_model.predict(X_test)

Output:

/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

n_iter_i = _check_optimize_result(

What does this output mean?

The output means that the logistic regression model didn’t fully converge within the specified number of iterations (max_iter=1000). Why? This could have happened because of:

Unscaled numeric values (example - ApplicantIncome, LoanAmount)
Need more iterations.

Model 2: Decision Tree Classifier

Use the code below to train this model:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train the Decision Tree model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# Predict on test data
tree_preds = tree_model.predict(X_test)

# Calculate accuracy
tree_acc = accuracy_score(y_test, tree_preds)

# Show results
print("Decision Tree Classifier Results:")
print("Predictions:", tree_preds)
print("Accuracy:", round(tree_acc * 100, 2), "%\n")

Output:

Decision Tree Classifier Results:

Predictions: [1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 0 1

1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1

0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0

0 1 1 0 0 1 1 0 0 0 1 0]

Accuracy: 69.11 %

What does this output mean?

It means that the decision tree model correctly predicted the loan status for about 69 out of every 100 samples in the test set. 1 means loan approved, whereas 0 means loan not approved.

Model 3: Random Forest Classifier

Use the code below to train this model:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train random forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on test set
rf_pred = rf_model.predict(X_test)

# Calculate accuracy
rf_acc = accuracy_score(y_test, rf_pred)

# Print results
print("Random Forest Classifier Results:")
print("Predictions:", rf_pred)
print("Accuracy:", round(rf_acc * 100, 2), "%")

Output:

Random Forest Classifier Results:

Predictions: [1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1

1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1

0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1

0 1 1 1 1 1 1 1 0 1 1 1]

Accuracy: 75.61 %

What does the output mean?

The output means that the model correctly predicted the loan status for about 76 out of every 100 test cases. This is better than your Decision Tree accuracy (69.11%).

Model 4: Support Vector Machine

Use the code below to train this model:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train SVM model
svm_model = SVC(kernel='linear', probability=True)
svm_model.fit(X_train, y_train)

# Predict on test set
svm_pred = svm_model.predict(X_test)

# Calculate accuracy
svm_acc = accuracy_score(y_test, svm_pred)

# Print results
print("Support Vector Machine (SVM) Results:")
print("Predictions:", svm_pred)
print("Accuracy:", round(svm_acc * 100, 2), "%")

Output:

Support Vector Machine (SVM) Results:

Predictions: [1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1

1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1

0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1

0 1 1 1 1 1 1 1 1 1 1 1]

Accuracy: 79.67 %

What does the output mean?

The output means that the model correctly predicted the loan status for about 80 out of every 100 test cases. Best performing model.

Step 8: Model Evaluation

Now, we will evaluate how well each model performed on unseen test data. But before that let’s define all four accuracy scores properly before printing. Use the below code to do so:

from sklearn.metrics import accuracy_score

# Logistic Regression Accuracy
log_acc = accuracy_score(y_test, log_pred)

# Decision Tree Accuracy
tree_acc = accuracy_score(y_test, tree_preds)

# Random Forest Accuracy
rf_acc = accuracy_score(y_test, rf_pred)

# SVM Accuracy
svm_acc = accuracy_score(y_test, svm_pred)

Now that we have defined those, use the code below to evaluate how well each model performed.

print("Model Performance Summary:")
print(f"Logistic Regression Accuracy: {log_acc * 100:.2f}%")
print(f"Decision Tree Accuracy:       {tree_acc * 100:.2f}%")
print(f"Random Forest Accuracy:       {rf_acc * 100:.2f}%")
print(f"SVM Accuracy:                 {svm_acc * 100:.2f}%")

Output:

Model Performance Summary:

Logistic Regression Accuracy: 78.86%

Decision Tree Accuracy: 69.11%

Random Forest Accuracy: 75.61%

SVM Accuracy: 79.67%

What does this output mean?

The Support Vector Machine (SVM) performed best. It correctly predicted loan approval status for nearly 80 out of every 100 cases.

Conclusion

Among the four models under testing, Support Vector Machine gave an accuracy of 79.67% for which it stands as the best performer in this loan prediction task. Logistic Regression almost duplicated its performance, but a convergence warning came up, which tells that it might be improved by more hyperparameter tuning or feature scaling. Decision Tree and Random Forest performed somewhere in the middle, but are less proficient than SVM.

These models can be further fine-tuned to give a better consideration of real-world performance to predict loan eligibility through optimization and cross-validation.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Data Science Career Path: A Comprehensive Career Guide	Data Science Career Growth: The Future of Work is here	Why is Data Science Important? 8 Ways Data Science Brings Value to the Business
Relevance of Data Science for Managers	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Collab Link:
https://colab.research.google.com/drive/1O3aZYd-7SKQN5W8H6hH1E6IKFrKhKEUQ

Frequently Asked Questions (FAQs)

1. What is the objective of a Loan Prediction project?

The main goal is to build a machine learning model to predict whether a loan application will be approved based on historical data, using features like income, credit history, and employment details.

2. Which tools and libraries are used in this project?

This project uses:

Python – for data processing and ML
Pandas & NumPy – for data manipulation
Matplotlib & Seaborn – for data visualization
Scikit-learn – to train and evaluate ML models
Google Colab – for cloud-based coding

3. What machine learning algorithms can be used to improve the model?

Besides the basic classifiers, you can explore:

XGBoost – for boosting accuracy with ensemble learning
Gradient Boosting – ideal for handling complex patterns
KNN – simple and effective in some scenarios
Naive Bayes – useful for faster predictions
Neural Networks (MLP) – for deeper learning on structured data

4. What similar data science project ideas can I explore next?

Here are five more beginner-to-intermediate project options:

Customer Churn Prediction
Sales Forecasting using Time Series
Credit Card Fraud Detection
HR Analytics: Predicting Employee Attrition
Insurance Claim Prediction

Check out our complete list of Top 25+ Essential Data Science Projects on GitHub.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources