Loan Prediction Project: Build a Model to Predict Loan Approvals with Confidence
By Rohit Sharma
Updated on Jul 22, 2025 | 11 min read | 1.3K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 22, 2025 | 11 min read | 1.3K+ views
Share:
Table of Contents
Loan prediction is a problem that is being faced by most of the financial institutions in the real world. Given historical data of prior applicants, the objective is to predict loan approval.
The aim of this project will be to create a machine learning model that predicts whether a loan application will be approved or not. We will begin by exploring and preprocessing the dataset, imputing missing values, and encoding categorical variables, and then we will conduct Exploratory Data Analysis (EDA). After that, we will train and verify different classification models like Logistic Regression, Decision Tree, and Random Forest to select the best-performing classifier.
Explore more project ideas such as this one in our blog post - Top 25+ Essential Data Science Projects GitHub to Explore in 2025.
Popular Data Science Programs
It is better to have at least some background in:
For this project, the following tools and libraries will be used:
Technology/Library |
Purpose |
Python |
Core programming language for data processing and ML |
Pandas |
Data loading, exploration, and manipulation |
NumPy |
Numerical operations and array handling |
Matplotlib |
Basic plotting and visualization |
Seaborn |
Enhanced statistical data visualization |
Scikit-learn |
Machine learning models, preprocessing, and evaluation |
Google Colab |
Writing and running Python code in a cloud-based, shareable notebook |
We will be using these:
The project will approximately take 3-4 hours to complete. It is perfect for beginners who want to practice data handling, classification techniques, and model evaluation.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To train our loan prediction model, we will use the dataset available on Kaggle. It contains two .csv files - train_u6lujuX_CVtuZ9i.csv and test_Y3wMUE5_7gLdaTN.csv. The training dataset file contains applicant details along with their loan status. Meanwhile, the testing dataset file contains only applicant details.
Follow the steps mentioned below to download the dataset:
Upload the downloaded files to Google Colab. Use the below code to do so:
from google.colab import files
uploaded = files.upload()
Now, load the data using the code below.
import pandas as pd
# Load the training and test datasets
train_df = pd.read_csv("train_u6lujuX_CVtuZ9i.csv")
test_df = pd.read_csv("test_Y3wMUE5_7gLdaTN.csv")
# Show data preview
print("Training Data Sample:")
print(train_df.head())
Output:
Training Data Sample:
Loan_ID Gender Married Dependents Education Self_Employed \
0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \
0 5849 0.0 NaN 360.0
1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0
Credit_History Property_Area Loan_Status
0 1.0 Urban Y
1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
Before moving ahead, let’s inspect the dataset structure, column names, and missing values. Use the code below to get a high-level overview of the data and pinpoint any issues, if there.
# Check shape of the dataset
print("Dataset shape:", train_df.shape)
# List all column names
print("Column names:", train_df.columns.tolist())
# Check for missing values
print("Missing values:\n", train_df.isnull().sum())
# Summary statistics of numerical columns
train_df.describe()
Output:
Dataset shape: (614, 13)
Column names: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']
Missing values:
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
ApplicantIncome |
CoapplicantIncome |
LoanAmount |
Loan_Amount_Term |
Credit_History |
|
count |
614.000000 |
614.000000 |
592.000000 |
600.00000 |
564.000000 |
mean |
5403.459283 |
1621.245798 |
146.412162 |
342.00000 |
0.842199 |
std |
6109.041673 |
2926.248369 |
85.587325 |
65.12041 |
0.364878 |
min |
150.000000 |
0.000000 |
9.000000 |
12.00000 |
0.000000 |
25% |
2877.500000 |
0.000000 |
100.000000 |
360.00000 |
1.000000 |
50% |
3812.500000 |
1188.500000 |
128.000000 |
360.00000 |
1.000000 |
75% |
5795.000000 |
2297.250000 |
168.000000 |
360.00000 |
1.000000 |
max |
81000.000000 |
41667.000000 |
700.000000 |
480.00000 |
1.000000
|
What does this output mean?
The output tells us that:
In this step, we will impute the missing data to avoid errors and improve accuracy. This is necessary before building our model. To do so, we will:
Use the below given code, to accomplish all this.
# Fill missing categorical values with the mode
for column in ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Credit_History']:
train_df[column].fillna(train_df[column].mode()[0], inplace=True)
# Fill missing numerical values with the median
train_df['LoanAmount'].fillna(train_df['LoanAmount'].median(), inplace=True)
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median(), inplace=True)
# Confirm missing values are handled
print("Missing values after cleaning:\n", train_df.isnull().sum())
Output:
Missing values after cleaning:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
/tmp/ipython-input-7-1566441025.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
train_df[column].fillna(train_df[column].mode()[0], inplace=True)
/tmp/ipython-input-7-1566441025.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
train_df['LoanAmount'].fillna(train_df['LoanAmount'].median(), inplace=True)
/tmp/ipython-input-7-1566441025.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median(), inplace=True)
Machine learning does not work with strings. It works with numbers. So in this step, we will convert categorical or text-based columns like Gender, Married, Education, etc., into numerical values. We will do this by utilizing Label Encoding.
Here is the code to do so:
from sklearn.preprocessing import LabelEncoder
# Create a label encoder object
le = LabelEncoder()
# List of categorical columns
categorical_cols = ['Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'Property_Area', 'Loan_Status']
# Apply label encoding to each column
for col in categorical_cols:
train_df[col] = le.fit_transform(train_df[col])
train_df.head()
Output:
Loan_ID |
Gender |
Married |
Dependents |
Education |
Self_Employed |
ApplicantIncome |
CoapplicantIncome |
LoanAmount |
Loan_Amount_Term |
Credit_History |
Property_Area |
Loan_Status |
|
0 |
LP001002 |
1 |
0 |
0 |
0 |
0 |
5849 |
0.0 |
128.0 |
360.0 |
1.0 |
2 |
1 |
1 |
LP001003 |
1 |
1 |
1 |
0 |
0 |
4583 |
1508.0 |
128.0 |
360.0 |
1.0 |
0 |
0 |
2 |
LP001005 |
1 |
1 |
0 |
0 |
1 |
3000 |
0.0 |
66.0 |
360.0 |
1.0 |
2 |
1 |
3 |
LP001006 |
1 |
1 |
0 |
1 |
0 |
2583 |
2358.0 |
120.0 |
360.0 |
1.0 |
2 |
1 |
4 |
LP001008 |
1 |
0 |
0 |
0 |
0 |
6000 |
0.0 |
141.0 |
360.0 |
1.0 |
2 |
1
|
Our dataset is clean and encoded. Therefore, now we will:
To accomplish all this, use the code given:
from sklearn.model_selection import train_test_split
# Select input features (X) and target variable (y)
X = train_df.drop('Loan_Status', axis=1) # All columns except the target
y = train_df['Loan_Status'] # Target column
# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Print shape to verify
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
Output:
Training set shape: (491, 12)
Test set shape: (123, 12)
What does this output mean?
It means that:
We will now train four popular classification models: Logistic Regression, Decision Tree, Random Forest, and SVM. These will help us understand which algorithm performs best for our dataset.
Use the code below to train this classification model:
from sklearn.linear_model import LogisticRegression
# Train logistic regression model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)
# Predict on test set
log_pred = log_model.predict(X_test)
Output:
/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
What does this output mean?
The output means that the logistic regression model didn’t fully converge within the specified number of iterations (max_iter=1000). Why? This could have happened because of:
Use the code below to train this model:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Train the Decision Tree model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)
# Predict on test data
tree_preds = tree_model.predict(X_test)
# Calculate accuracy
tree_acc = accuracy_score(y_test, tree_preds)
# Show results
print("Decision Tree Classifier Results:")
print("Predictions:", tree_preds)
print("Accuracy:", round(tree_acc * 100, 2), "%\n")
Output:
Decision Tree Classifier Results:
Predictions: [1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 0 1
1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1
0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0
0 1 1 0 0 1 1 0 0 0 1 0]
Accuracy: 69.11 %
What does this output mean?
It means that the decision tree model correctly predicted the loan status for about 69 out of every 100 samples in the test set. 1 means loan approved, whereas 0 means loan not approved.
Use the code below to train this model:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Train random forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Predict on test set
rf_pred = rf_model.predict(X_test)
# Calculate accuracy
rf_acc = accuracy_score(y_test, rf_pred)
# Print results
print("Random Forest Classifier Results:")
print("Predictions:", rf_pred)
print("Accuracy:", round(rf_acc * 100, 2), "%")
Output:
Random Forest Classifier Results:
Predictions: [1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1
0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1
0 1 1 1 1 1 1 1 0 1 1 1]
Accuracy: 75.61 %
What does the output mean?
The output means that the model correctly predicted the loan status for about 76 out of every 100 test cases. This is better than your Decision Tree accuracy (69.11%).
Use the code below to train this model:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Train SVM model
svm_model = SVC(kernel='linear', probability=True)
svm_model.fit(X_train, y_train)
# Predict on test set
svm_pred = svm_model.predict(X_test)
# Calculate accuracy
svm_acc = accuracy_score(y_test, svm_pred)
# Print results
print("Support Vector Machine (SVM) Results:")
print("Predictions:", svm_pred)
print("Accuracy:", round(svm_acc * 100, 2), "%")
Output:
Support Vector Machine (SVM) Results:
Predictions: [1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1
0 1 1 1 1 1 1 1 1 1 1 1]
Accuracy: 79.67 %
What does the output mean?
The output means that the model correctly predicted the loan status for about 80 out of every 100 test cases. Best performing model.
Now, we will evaluate how well each model performed on unseen test data. But before that let’s define all four accuracy scores properly before printing. Use the below code to do so:
from sklearn.metrics import accuracy_score
# Logistic Regression Accuracy
log_acc = accuracy_score(y_test, log_pred)
# Decision Tree Accuracy
tree_acc = accuracy_score(y_test, tree_preds)
# Random Forest Accuracy
rf_acc = accuracy_score(y_test, rf_pred)
# SVM Accuracy
svm_acc = accuracy_score(y_test, svm_pred)
Now that we have defined those, use the code below to evaluate how well each model performed.
print("Model Performance Summary:")
print(f"Logistic Regression Accuracy: {log_acc * 100:.2f}%")
print(f"Decision Tree Accuracy: {tree_acc * 100:.2f}%")
print(f"Random Forest Accuracy: {rf_acc * 100:.2f}%")
print(f"SVM Accuracy: {svm_acc * 100:.2f}%")
Output:
Model Performance Summary:
Logistic Regression Accuracy: 78.86%
Decision Tree Accuracy: 69.11%
Random Forest Accuracy: 75.61%
SVM Accuracy: 79.67%
What does this output mean?
The Support Vector Machine (SVM) performed best. It correctly predicted loan approval status for nearly 80 out of every 100 cases.
Among the four models under testing, Support Vector Machine gave an accuracy of 79.67% for which it stands as the best performer in this loan prediction task. Logistic Regression almost duplicated its performance, but a convergence warning came up, which tells that it might be improved by more hyperparameter tuning or feature scaling. Decision Tree and Random Forest performed somewhere in the middle, but are less proficient than SVM.
These models can be further fine-tuned to give a better consideration of real-world performance to predict loan eligibility through optimization and cross-validation.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://colab.research.google.com/drive/1O3aZYd-7SKQN5W8H6hH1E6IKFrKhKEUQ
804 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources