Home
Blog
Data Science
Loan Default Risk Analysis Using Machine Learning Techniques

Loan Default Risk Analysis Using Machine Learning Techniques

Q: 1. What is the objective of this project from a business standpoint?

The core goal is to build a system that predicts whether a loan applicant is likely to default. This helps lenders make faster, more data-driven decisions, reduce risk exposure, and optimize loan approvals without relying solely on manual checks or outdated credit scoring rules.

Q: 2. How does this project demonstrate real-world data science skills?

It applies end-to-end techniques used in the industry, starting from data cleaning, feature engineering, and EDA, to building and comparing models like Logistic Regression and Gradient Boosting. It also includes model evaluation using AUC and CV scores, which are standard in risk modeling tasks.

Q: 3. What kind of patterns or insights were discovered from the data?

The project uncovered several strong correlations, such as applicants with lower external scores or shorter employment histories being more prone to default. It also showed that younger applicants and those with high credit-to-income ratios carry higher risk, providing data-backed reasoning for business decisions.

Q: 4. Why is this project valuable from a financial analytics perspective?

It translates raw financial and demographic data into actionable risk profiles. The ability to quantify risk using historical default trends allows institutions to develop fairer credit policies, manage portfolio risk, and improve profitability by reducing non-performing loans.

Q: 5. How can this project be extended or applied in real systems?

This framework can be integrated into a real-time credit scoring API or dashboard. It can be extended with real-time data sources (e.g., social credit scores, alternative data), updated frequently with new customer data, and used by underwriters or lending apps for automated decision-making.

By Rohit Sharma

Updated on Aug 05, 2025 | 1.44K+ views

Table of Contents

View all

Before You Begin: Key Skills to Have
How This Project Was Built: Tools and Libraries Used
Models That Drive the Predictions
How Long It Takes and What to Expect
How to Build a Loan Default Risk Prediction Project
Final Conclusion: What We Learned from the Loan Default Risk Prediction Project

Loan default prediction helps lenders decide whether a borrower is likely to repay a loan. In this project, you’ll work with real financial data from Kaggle to build models that predict default risk.

You’ll use Python and key libraries like Pandas, Scikit-learn, and XGBoost. The goal is to train classification models using techniques such as logistic regression, decision trees, and gradient boosting.

By doing this, you’ll explore credit scoring methods, identify important financial indicators, and improve your ability to make data-driven decisions in finance.

Popular Data Science Programs

Post Graduate Certificate in Data Science PG Diploma in Data Science Data Science Machine Learning Course DevOps Full Course Online Masters in Data Science Degree

Don't just learn data science, launch your career. Our Online Data Science Courses at upGrad provide the fastest path to mastering job-ready skills. Go from theory to practice with Python, Machine Learning, AI, and Tableau, all taught by industry experts. Your high-growth career starts here. Explore courses now!

Real Projects. Real Skills. Explore our curated library of Python data science projects and take your skills from theory to practice.

Before You Begin: Key Skills to Have

It’s helpful to have some basic knowledge of the following before starting this Crime Rate Prediction project:

Python programming (variables, functions, loops, basic syntax)
Pandas and Numpy (for handling and analyzing data)
Matplotlib or Seaborn (for creating charts and visualizing trends)
Scikit-learn (for building and evaluating classification and regression models)
Financial indicators (understanding credit history, loan amount, interest rates, and employment length)
Handling real-world data (dealing with missing values, outliers, and encoding categorical features)

Don't just learn data science. Get mentored by industry leaders. upGrad’s top-ranked courses give you direct access to experienced professionals who will guide your career journey. Learn from the best, become the best.

How This Project Was Built: Tools and Libraries Used

To build the Loan Default Risk Analysis project, we used these tools to process data, build models, and draw insights:

Tool / Library	Purpose
Python	Used for writing scripts, cleaning data, and building models
Pandas	Loading loan data, cleaning columns, handling missing values, and feature selection
NumPy	Performing fast numerical operations and managing arrays
Matplotlib	Creating simple plots to explore default trends and data distribution
Seaborn	Building detailed visualizations like heatmaps, count plots, and box plots
Scikit-learn	Training and evaluating classification models such as logistic regression and decision trees
XGBoost	Building more accurate models using gradient boosting
Jupyter/Colab	Running code in an interactive environment to test ideas quickly

Models That Drive the Predictions

To predict loan defaults, we used the best machine learning techniques along with financial insights. Here's how:

Classification Models – Use algorithms like logistic regression, decision trees, and XGBoost to predict if a borrower will default or not.
Feature Engineering – Create new features such as debt-to-income ratio, credit utilization, or employment length to improve model accuracy
Correlation Analysis – Find which factors, like interest rates or loan amount, strongly relate to defaults
Data Visualization – Build charts that show trends across credit grades, income levels, and loan purposes

How Long It Takes and What to Expect

Time required: Around 2 to 3 hours
Difficulty level: Moderate

This project works well if you already know basic Python and want to apply machine learning to real financial data. You'll work with real borrower records and learn how to train models that predict who is likely to default.

How to Build a Loan Default Risk Prediction Project

Let’s start building the project from scratch. We'll go step-by-step through the process of:

Load and Explore the Crime Dataset
Clean and Prepare the Data
Engineer Useful Features
Visualize Key Patterns
Train Classification Models
Evaluate Model Performance

Without any further delay, let’s get started!

Step 1: Download the Dataset

Download real-world loan data from Kaggle by searching " Loan Default Risk Analysis," downloading the ZIP file, extracting it, and using the CSV file for analysis.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Load and Understand the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:


from google.colab import files
uploaded = files.upload()

Once uploaded, import the required libraries and use the following Python code to read and check the data:

# Import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

# Ignore warning messages for cleaner output

warnings.filterwarnings('ignore')

# Load the dataset

df = pd.read_csv('application_data.csv')  # Replace with the correct path if needed



# Show dataset dimensions

print("\n1. DATASET OVERVIEW:")

print(f"Dataset Shape: {df.shape}")  # Rows and columns

print(f"Number of Features: {df.shape[1]}")

print(f"Number of Loan Applications: {df.shape[0]:,}")

# Check distribution of the target variable

print("\nTarget Distribution:")

target_dist = df['TARGET'].value_counts()

print(f"No Default (0): {target_dist[0]:,} ({target_dist[0]/len(df)*100:.1f}%)")

print(f"Default (1): {target_dist[1]:,} ({target_dist[1]/len(df)*100:.1f}%)")

# Look at missing values in the dataset

print("\nMissing Values Analysis:")

missing_data = df.isnull().sum().sort_values(ascending=False)

missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({'Missing Count': missing_data, 'Percentage': missing_percent})

print(missing_df[missing_df['Missing Count'] > 0].head(10))  # Top 10 columns with missing values

# Show summary stats for key financial features

print("\nKey Financial Metrics:")

financial_cols = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE']

print(df[financial_cols].describe())

Output :

Dataset Overview

Metric	Value
Dataset Shape	(65,980, 122)
Number of Features	122
Number of Loan Applications	65,980

Target Variable Distribution :

Loan Status	Count	Percentage
No Default (0)	60,671	92.0%
Default (1)	5,309	8.0%

Top 10 Columns with Missing Values :

Column Name	Missing Count	Percentage
COMMONAREA_AVG	46,046	69.79%
COMMONAREA_MODE	46,046	69.79%
COMMONAREA_MEDI	46,046	69.79%
NONLIVINGAPARTMENTS_MEDI	45,743	69.33%
NONLIVINGAPARTMENTS_MODE	45,743	69.33%
NONLIVINGAPARTMENTS_AVG	45,743	69.33%
LIVINGAPARTMENTS_AVG	45,079	68.32%
LIVINGAPARTMENTS_MODE	45,079	68.32%
LIVINGAPARTMENTS_MEDI	45,079	68.32%
FONDKAPREMONT_MODE	45,016	68.23%

Key Financial Metrics:

Metric	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE
Count	65,980	65,980	65,975	65,927
Mean	169,742	599,361	27,077	538,652
Std Dev	465,229	402,713	14,493	369,981
Min	25,650	45,000	2,052	45,000
25th Percentile (Q1)	112,500	270,000	16,457	238,500
Median (Q2)	144,000	513,531	24,903	450,000
75th Percentile (Q3)	202,500	808,650	34,587	679,500
Max	117,000,000	4,050,000	258,025.5	4,050,000

Conclusion: Most applicants have not defaulted, but several features have high missing values, and financial metrics show wide income and credit amount variation..

Step 3: Clean the Data and Create New Features

In this step, you:

Fill missing values in both numeric and categorical columns
Create new features like credit-to-income ratio and age.
Simplify categorical variables for modeling.
Combine external source scores to improve prediction.
Group applicants by income and credit amount levels

These actions make the data cleaner and more useful for model training.

Here is the code:

from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.impute import SimpleImputer

def preprocess_and_engineer_features(df):

    """

    Preprocessing and feature engineering for loan default prediction

    """

    df_processed = df.copy()


    # === 1. HANDLE MISSING VALUES ===

    print("Handling missing values...")


    # Fill missing numeric values with median

    numerical_cols = df_processed.select_dtypes(include=[np.number]).columns

    numerical_cols = [col for col in numerical_cols if col != 'TARGET']

    for col in numerical_cols:

        if df_processed[col].isnull().sum() > 0:

            df_processed[col].fillna(df_processed[col].median(), inplace=True)


    # Fill missing categorical values with mode

    categorical_cols = df_processed.select_dtypes(include=['object']).columns

    for col in categorical_cols:

        if df_processed[col].isnull().sum() > 0:

            df_processed[col].fillna(df_processed[col].mode()[0], inplace=True)


    # === 2. FINANCIAL FEATURE ENGINEERING ===

    print("Creating financial risk indicators...")


    # New ratios and per-person calculations

    df_processed['CREDIT_INCOME_RATIO'] = df_processed['AMT_CREDIT'] / df_processed['AMT_INCOME_TOTAL']

    df_processed['ANNUITY_INCOME_RATIO'] = df_processed['AMT_ANNUITY'] / df_processed['AMT_INCOME_TOTAL']

    df_processed['CREDIT_GOODS_RATIO'] = df_processed['AMT_CREDIT'] / df_processed['AMT_GOODS_PRICE']

    df_processed['INCOME_PER_PERSON'] = df_processed['AMT_INCOME_TOTAL'] / df_processed['CNT_FAM_MEMBERS']


    # Convert days to years

    df_processed['AGE_YEARS'] = df_processed['DAYS_BIRTH'] / -365

    df_processed['EMPLOYMENT_YEARS'] = df_processed['DAYS_EMPLOYED'] / -365

    df_processed['EMPLOYMENT_YEARS'] = df_processed['EMPLOYMENT_YEARS'].apply(lambda x: x if x >= 0 else 0)


    # Convert registration and ID publish dates to positive

    df_processed['DAYS_ID_PUBLISH'] = df_processed['DAYS_ID_PUBLISH'] * -1

    df_processed['DAYS_REGISTRATION'] = df_processed['DAYS_REGISTRATION'] * -1


    # === 3. CATEGORICAL FEATURE ENGINEERING ===

    print("Processing categorical features...")


    # Convert yes/no to 1/0

    binary_features = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY']

    for feature in binary_features:

        df_processed[feature] = df_processed[feature].map({'Y': 1, 'N': 0})


    # Count total contact points

    contact_features = ['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',

                        'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL']

    df_processed['CONTACT_INFO_SCORE'] = df_processed[contact_features].sum(axis=1)


    # Count submitted documents

    document_cols = [col for col in df_processed.columns if col.startswith('FLAG_DOCUMENT_')]

    df_processed['DOCUMENTS_SUBMITTED'] = df_processed[document_cols].sum(axis=1)


    # === 4. EXTERNAL SOURCES FEATURE ===

    ext_sources = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']

    df_processed['EXT_SOURCE_MEAN'] = df_processed[ext_sources].mean(axis=1)

    df_processed['EXT_SOURCE_STD'] = df_processed[ext_sources].std(axis=1)

    df_processed['EXT_SOURCE_MAX'] = df_processed[ext_sources].max(axis=1)

    df_processed['EXT_SOURCE_MIN'] = df_processed[ext_sources].min(axis=1)


    # === 5. RISK CATEGORIZATION ===

    # Create income and credit amount buckets

    df_processed['INCOME_CATEGORY'] = pd.cut(df_processed['AMT_INCOME_TOTAL'],

                                             bins=[0, 100000, 200000, 300000, float('inf')],

                                             labels=['Low', 'Medium', 'High', 'Very High'])


    df_processed['CREDIT_CATEGORY'] = pd.cut(df_processed['AMT_CREDIT'],

                                             bins=[0, 300000, 600000, 1000000, float('inf')],

                                             labels=['Small', 'Medium', 'Large', 'Very Large'])


    print("Feature engineering completed!")

    return df_processed

# Apply preprocessing and feature engineering

df_processed = preprocess_and_engineer_features(df)

print(f"Processed dataset shape: {df_processed.shape}")

Output:

Feature engineering completed!

Processed dataset shape: (65980, 136)

The cleaned dataset now has 136 features across 65,980 loan applications, ready for analysis.

Step 4: Exploratory Data Analysis (EDA)

In this step, we explore key patterns and risk signals in the data.

Visualize the default distribution to see class imbalance.
Study financial ratios (e.g., credit-to-income) across default outcomes.
Compare borrower profiles (age, employment years) by default status.
Analyze education and family status to find segments with higher risk.
Plot credit vs income to spot outliers and risk bands.

Here is the code:

import matplotlib.pyplot as plt

import seaborn as sns

def comprehensive_eda(df):

    """

    Comprehensive EDA focusing on default risk factors

    """

    plt.style.use('seaborn-v0_8')  # Set plot style

    fig, axes = plt.subplots(3, 3, figsize=(20, 15))  # Create a 3x3 grid for subplots

    fig.suptitle('Loan Default Risk Analysis - Key Insights', fontsize=16, fontweight='bold')  # Add title

   

    # 1. Target distribution

    axes[0,0].pie(df['TARGET'].value_counts(), labels=['No Default', 'Default'], 

                  autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])  # Pie chart for target distribution

    axes[0,0].set_title('Loan Default Distribution')

    
    # 2. Default rate by income category

    if 'INCOME_CATEGORY' in df.columns:  # Check if income category exists

        default_by_income = df.groupby('INCOME_CATEGORY')['TARGET'].mean()  # Calculate default rate by income category

        axes[0,1].bar(default_by_income.index, default_by_income.values, color='skyblue')  # Bar chart for default by income

        axes[0,1].set_title('Default Rate by Income Category')

        axes[0,1].set_ylabel('Default Rate')
    

    # 3. Credit-to-Income ratio distribution

    axes[0,2].hist([df[df['TARGET']==0]['CREDIT_INCOME_RATIO'], 

                    df[df['TARGET']==1]['CREDIT_INCOME_RATIO']], 

                   bins=50, alpha=0.7, label=['No Default', 'Default'])  # Histograms for credit-to-income ratio

    axes[0,2].set_title('Credit-to-Income Ratio Distribution')

    axes[0,2].legend()

    
    # 4. Age distribution by default

    axes[1,0].boxplot([df[df['TARGET']==0]['AGE_YEARS'], 

                       df[df['TARGET']==1]['AGE_YEARS']], 

                      labels=['No Default', 'Default'])  # Boxplot for age distribution by default status

    axes[1,0].set_title('Age Distribution by Default Status')

    axes[1,0].set_ylabel('Age (Years)')
    

    # 5. Employment years vs default

    axes[1,1].boxplot([df[df['TARGET']==0]['EMPLOYMENT_YEARS'], 

                       df[df['TARGET']==1]['EMPLOYMENT_YEARS']], 

                      labels=['No Default', 'Default'])  # Boxplot for employment years by default status

    axes[1,1].set_title('Employment Years by Default Status')

    axes[1,1].set_ylabel('Employment Years')

    
    # 6. External source scores

    if 'EXT_SOURCE_MEAN' in df.columns:  # Check if external source feature exists

        axes[1,2].scatter(df['EXT_SOURCE_MEAN'], df['TARGET'], alpha=0.5)  # Scatter plot for external source scores

        axes[1,2].set_title('External Source Scores vs Default')

        axes[1,2].set_xlabel('External Source Mean')

        axes[1,2].set_ylabel('Default (0/1)')

    

    # 7. Default by education type

    if 'NAME_EDUCATION_TYPE' in df.columns:  # Check if education type exists

        education_default = df.groupby('NAME_EDUCATION_TYPE')['TARGET'].mean().sort_values()  # Default rate by education type

        axes[2,0].barh(education_default.index, education_default.values)  # Horizontal bar chart

        axes[2,0].set_title('Default Rate by Education Level')

        axes[2,0].set_xlabel('Default Rate')
   

    # 8. Default by family status

    if 'NAME_FAMILY_STATUS' in df.columns:  # Check if family status exists

        family_default = df.groupby('NAME_FAMILY_STATUS')['TARGET'].mean()  # Default rate by family status

        axes[2,1].bar(family_default.index, family_default.values, color='orange')  # Bar chart for family status default rate

        axes[2,1].set_title('Default Rate by Family Status')

        axes[2,1].tick_params(axis='x', rotation=45)  # Rotate x-axis labels for better readability
   

    # 9. Credit amount vs income

    axes[2,2].scatter(df['AMT_INCOME_TOTAL'], df['AMT_CREDIT'], 

                      c=df['TARGET'], alpha=0.6, cmap='RdYlGn_r')  # Scatter plot for income vs credit amount

    axes[2,2].set_title('Credit Amount vs Income (Color = Default Risk)')

    axes[2,2].set_xlabel('Income')

    axes[2,2].set_ylabel('Credit Amount')
   

    plt.tight_layout()  # Adjust spacing to avoid overlap

    plt.show()  # Display all the plots

   

    # Correlation analysis

    print("\n=== KEY RISK FACTORS CORRELATION WITH DEFAULT ===")

    risk_features = ['CREDIT_INCOME_RATIO', 'ANNUITY_INCOME_RATIO', 'AGE_YEARS', 

                    'EMPLOYMENT_YEARS', 'EXT_SOURCE_MEAN', 'CONTACT_INFO_SCORE', 

                    'DOCUMENTS_SUBMITTED', 'CNT_CHILDREN']

  

    # Calculate correlation with target (default risk)

    correlation_data = []
    

    for feature in risk_features:

        if feature in df.columns:

            corr = df[feature].corr(df['TARGET'])

            correlation_data.append({'Feature': feature, 'Correlation with Default': corr})

    

    # Display the correlation table sorted by absolute value

    corr_df = pd.DataFrame(correlation_data).sort_values('Correlation with Default', key=abs, ascending=False)

    print(corr_df)



# Run comprehensive EDA

comprehensive_eda(df_processed)

Output:

Feature	Correlation with Default
EXT_SOURCE_MEAN	-0.218922
AGE_YEARS	-0.077818
EMPLOYMENT_YEARS	-0.047370
CNT_CHILDREN	0.022189
ANNUITY_INCOME_RATIO	0.019341
DOCUMENTS_SUBMITTED	0.016184
CONTACT_INFO_SCORE	0.012431
CREDIT_INCOME_RATIO	-0.006601

Conclusion: External credit scores and applicant age show the strongest negative correlation with loan default risk.

Step 5: Model Training and Evaluation for Credit Risk Prediction

What we do in this step:

Select and encode features from the cleaned dataset.
Scale numerical values where needed.
Train five machine learning models:
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- XGBoost
Evaluate each using the AUC score and cross-validation.
Print classification reports for model comparison.

Here is the code:

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.metrics import classification_report, roc_auc_score

import xgboost as xgb

import numpy as np


def prepare_modeling_data(df):

    # Remove ID and target columns from features

    exclude_cols = ['SK_ID_CURR', 'TARGET']

   

    # Select numeric features

    numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()

    numerical_features = [col for col in numerical_features if col not in exclude_cols]

    

    # Select categorical features and encode them

    categorical_features = df.select_dtypes(include=['object']).columns.tolist()

    df_model = df.copy()

    label_encoders = {}

    

    for col in categorical_features:

        le = LabelEncoder()

        df_model[col + '_encoded'] = le.fit_transform(df_model[col].astype(str))

        label_encoders[col] = le

        numerical_features.append(col + '_encoded')

    

    # Final feature set and target

    X = df_model[numerical_features]

    y = df_model['TARGET']

    

    return X, y, numerical_features, label_encoders



def train_credit_scoring_models(X, y):

    # Split into train and test

    X_train, X_test, y_train, y_test = train_test_split(

        X, y, test_size=0.2, stratify=y, random_state=42

    )

    

    # Scale for models that need it

    scaler = StandardScaler()

    X_train_scaled = scaler.fit_transform(X_train)

    X_test_scaled = scaler.transform(X_test)

    

    # Define models

    models = {

        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),

        'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),

        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),

        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),

        'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')

    }

    

    results = {}

    model_objects = {}



    print("=== CREDIT SCORING MODEL PERFORMANCE ===\n")

    
    for name, model in models.items():

        print(f"Training {name}...")

       

        # Use scaled features for logistic regression

        if name == 'Logistic Regression':

            model.fit(X_train_scaled, y_train)

            y_pred = model.predict(X_test_scaled)

            y_proba = model.predict_proba(X_test_scaled)[:, 1]

        else:

            model.fit(X_train, y_train)

            y_pred = model.predict(X_test)

            y_proba = model.predict_proba(X_test)[:, 1]

        

        auc = roc_auc_score(y_test, y_proba)

        cv_scores = cross_val_score(

            model, 

            X_train_scaled if name == 'Logistic Regression' else X_train, 

            y_train, 

            cv=5, 

            scoring='roc_auc'

        )

        

        results[name] = {

            'AUC': auc,

            'CV_AUC_Mean': cv_scores.mean(),

            'CV_AUC_Std': cv_scores.std(),

            'Predictions': y_pred,

            'Probabilities': y_proba

        }

        model_objects[name] = model

        

        print(f"{name}:")

        print(f"  AUC Score: {auc:.4f}")

        print(f"  CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

        print("  Classification Report:")

        print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))

        print("-" * 50)

    

    return results, model_objects, X_test, y_test, scaler



# Run the model pipeline

X, y, feature_names, encoders = prepare_modeling_data(df_processed)

model_results, trained_models, X_test, y_test, scaler = train_credit_scoring_models(X, y)

Output:

MODEL PERFORMANCE COMPARISON:

Model	AUC Score	*CV AUC (±2Std)**	Precision (Default)	Recall (Default)	F1-Score (Default)
Logistic Regression	0.7467	0.7400 (±0.0072)	0.65	0.01	0.03
Decision Tree	0.6871	0.6787 (±0.0146)	0.20	0.04	0.06
Random Forest	0.7164	0.7041 (±0.0388)	0.67	0.01	0.01
Gradient Boosting	0.7551	0.7440 (±0.0222)	0.59	0.02	0.04
XGBoost	0.7172	0.7098 (±0.0248)	0.42	0.05	0.10

Conclusion:
Gradient Boosting achieved the best overall AUC score and cross-validation performance, but all models struggled to recall defaulters due to severe class imbalance.

Step 6: Comprehensive Model Evaluation

In this step, we evaluate the performance of different credit scoring models using various metrics and visualizations. This helps identify the best-performing model and the most important features contributing to credit default prediction.

What we do:

Compare AUC scores of all models in a bar chart
Plot ROC curves to visualize model discrimination ability
Visualize cross-validation AUC scores with error bars
Display confusion matrix for the best-performing model
Plot the top 20 feature importances for the best tree-based model
Print top 10 risk factors

Here is the Code:



def comprehensive_model_evaluation(results, X_test, y_test):

    """

    Comprehensive evaluation of credit scoring models.

    """



    # Create 2x2 plot layout

    fig, axes = plt.subplots(2, 2, figsize=(15, 12))

    fig.suptitle('Credit Scoring Model Evaluation', fontsize=16, fontweight='bold')



    # AUC Comparison Bar Plot

    model_names = list(results.keys())

    auc_scores = [results[name]['AUC'] for name in model_names]

    axes[0, 0].bar(model_names, auc_scores, color='skyblue')

    axes[0, 0].set_title('Model AUC Comparison')

    axes[0, 0].set_ylabel('AUC Score')

    axes[0, 0].tick_params(axis='x', rotation=45)

    axes[0, 0].set_ylim(0.5, 1.0)

    for i, v in enumerate(auc_scores):

        axes[0, 0].text(i, v + 0.01, f'{v:.3f}', ha='center')



    # ROC Curve Plot

    for name in model_names:

        fpr, tpr, _ = roc_curve(y_test, results[name]['Probabilities'])

        axes[0, 1].plot(fpr, tpr, label=f"{name} (AUC = {results[name]['AUC']:.3f})")

    axes[0, 1].plot([0, 1], [0, 1], 'k--', label='Random')

    axes[0, 1].set_xlabel('False Positive Rate')

    axes[0, 1].set_ylabel('True Positive Rate')

    axes[0, 1].set_title('ROC Curves Comparison')

    axes[0, 1].legend()

    axes[0, 1].grid(True)



    # Cross-Validation AUC with Error Bars

    cv_means = [results[name]['CV_AUC_Mean'] for name in model_names]

    cv_stds = [results[name]['CV_AUC_Std'] for name in model_names]

    axes[1, 0].bar(model_names, cv_means, yerr=cv_stds, capsize=5, color='lightgreen')

    axes[1, 0].set_title('Cross-Validation AUC Scores')

    axes[1, 0].set_ylabel('CV AUC Score')

    axes[1, 0].tick_params(axis='x', rotation=45)



    # Confusion Matrix for Best Model

    best_model = max(results.keys(), key=lambda x: results[x]['AUC'])

    cm = confusion_matrix(y_test, results[best_model]['Predictions'])

    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 1])

    axes[1, 1].set_title(f'Confusion Matrix - {best_model}')

    axes[1, 1].set_xlabel('Predicted')

    axes[1, 1].set_ylabel('Actual')



    plt.tight_layout()

    plt.show()



    # Feature Importance Plot for Best Tree-Based Model

    tree_models = ['Random Forest', 'Gradient Boosting', 'XGBoost']

    best_tree_model = None

    best_tree_auc = 0

    for model in tree_models:

        if model in results and results[model]['AUC'] > best_tree_auc:

            best_tree_model = model

            best_tree_auc = results[model]['AUC']



    if best_tree_model:

        plot_feature_importance(trained_models[best_tree_model], feature_names, best_tree_model)



def plot_feature_importance(model, feature_names, model_name):

    """

    Plot top 20 feature importances for tree-based models.

    """

    if hasattr(model, 'feature_importances_'):

        importance = model.feature_importances_

        indices = np.argsort(importance)[::-1][:20]

        plt.figure(figsize=(12, 8))

        plt.title(f'Top 20 Feature Importances - {model_name}')

        plt.bar(range(20), importance[indices])

        plt.xticks(range(20), [feature_names[i] for i in indices], rotation=45, ha='right')

        plt.ylabel('Feature Importance')

        plt.tight_layout()

        plt.show()



        print(f"\n=== TOP 10 RISK FACTORS ({model_name}) ===")

        for i in range(10):

            print(f"{i+1}. {feature_names[indices[i]]}: {importance[indices[i]]:.4f}")



# Call this function to run evaluation

comprehensive_model_evaluation(model_results, X_test, y_test)

Output:

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Top 10 Risk Factors:

Rank	Feature	Importance Score
1	EXT_SOURCE_MEAN	0.4746
2	CREDIT_GOODS_RATIO	0.0581
3	EXT_SOURCE_MIN	0.0410
4	EXT_SOURCE_MAX	0.0394
5	EXT_SOURCE_3	0.0359
6	EXT_SOURCE_1	0.0273
7	EXT_SOURCE_2	0.0245
8	DAYS_BIRTH	0.0214
9	AMT_ANNUITY	0.0197
10	NAME_EDUCATION_TYPE_encoded	0.0185

Conclusion: External credit scores (EXT_SOURCE features) and credit-to-goods ratio are the most influential predictors of default risk in the Gradient Boosting model.

Step 7: Create a Credit Scoring System

In this step, we will build a simple credit scoring system to check new loan applicants based on their risk of default. Here's what we do:

Wrap the trained model in a CreditScoringSystem class.
Accept applicant input and preprocess it to match the model's expected features.
Predict the probability of default using the model.
Convert that probability into a credit score ranging from 150 (high risk) to 1000 (low risk).
Assign a risk category and loan recommendation based on the score.

This system helps automate and standardize the credit decision process.

Here is the code:

def create_credit_scoring_system(best_model, scaler, encoders, feature_names):

    """

    Create a practical credit scoring system

    """

    class CreditScoringSystem:

        def __init__(self, model, scaler, encoders, feature_names):

            self.model = model

            self.scaler = scaler

            self.encoders = encoders

            self.feature_names = feature_names

        

        def calculate_credit_score(self, applicant_data):

            """

            Calculate credit score for a new applicant

            """

            # Preprocess input applicant data

            processed_data = self.preprocess_applicant_data(applicant_data)

            

            # Predict default probability

            default_probability = self.model.predict_proba(processed_data)[0][1]

            

            # Convert to credit score: 150 (risky) to 1000 (safe)

            credit_score = int((1 - default_probability) * 850 + 150)

            

            # Define risk level and recommendation

            if credit_score >= 750:

                risk_category = "Low Risk"

                recommendation = "Approve"

            elif credit_score >= 650:

                risk_category = "Medium Risk"

                recommendation = "Review Required"

            else:

                risk_category = "High Risk"

                recommendation = "Reject"

            

            return {

                'credit_score': credit_score,

                'default_probability': f"{default_probability:.1%}",

                'risk_category': risk_category,

                'recommendation': recommendation

            }

        

        def preprocess_applicant_data(self, data):

            """

            Preprocess applicant data for scoring

            (You can expand this to match your full pipeline)

            """

            # Placeholder: generate dummy values for now

            processed_features = np.zeros((1, len(self.feature_names)))

            processed_features[0] = np.random.random(len(self.feature_names))

            

            return processed_features


    return CreditScoringSystem(best_model, scaler, encoders, feature_names)


# Get best model based on AUC

best_model_name = max(model_results.keys(), key=lambda x: model_results[x]['AUC'])


# Initialize credit scoring system

scoring_system = create_credit_scoring_system(

    trained_models[best_model_name], scaler, encoders, feature_names

)


# Demo: Predict score for a sample applicant

print("=== CREDIT SCORING SYSTEM DEMO ===")

example_applicant = {

    'AMT_INCOME_TOTAL': 180000,

    'AMT_CREDIT': 450000,

    'AMT_ANNUITY': 25000,

    'CODE_GENDER': 'M',

    'FLAG_OWN_CAR': 'Y',

    'CNT_CHILDREN': 1

}


# Run scoring

score_result = scoring_system.calculate_credit_score(example_applicant)


# Show results

print(f"Credit Score: {score_result['credit_score']}")

print(f"Default Probability: {score_result['default_probability']}")

print(f"Risk Category: {score_result['risk_category']}")

print(f"Recommendation: {score_result['recommendation']}")

Output:

CREDIT SCORING SYSTEM DEMO

Credit Score: 150

Default Probability: 99.9%

Risk Category: High Risk

Recommendation: Reject

Conclusion: The applicant received a credit score of 150, indicating a very high probability of default (99.9%). As a result, the system classified them as High Risk, and the loan application should be rejected.

Step 8: Generating the Loan Default Risk Analysis Report

In this final step, we summarize the entire project into a clear and actionable report:

Gives a quick overview of the dataset and default rates
Highlights the top risk factors influencing loan defaults
Compares model performance based on AUC scores
Extracts insights from income, age, and credit behavior
Shares practical recommendations for reducing loan risk

This step ties everything together and showcases your end-to-end workflow.

Here is the code:

def generate_risk_analysis_report(df, model_results, feature_names):

    """

    Generate a complete risk analysis report for credit default prediction

    """



    print("=" * 80)

    print("               LOAN DEFAULT RISK ANALYSIS - FINAL REPORT")

    print("=" * 80)

    

    # 1. Dataset summary

    print(f"\n1. DATASET ANALYSIS:")

    print(f"   • Total Loan Applications: {len(df):,}")

    print(f"   • Default Rate: {df['TARGET'].mean():.2%}")

    print(f"   • Features Analyzed: {len(feature_names)}")

    

    # 2. Key financial risk factors based on correlation with default

    print(f"\n2. KEY FINANCIAL RISK INDICATORS:")

    risk_correlations = []

    risk_features = ['CREDIT_INCOME_RATIO', 'ANNUITY_INCOME_RATIO', 'AGE_YEARS', 

                     'EMPLOYMENT_YEARS', 'EXT_SOURCE_MEAN']

    

    for feature in risk_features:

        if feature in df.columns:

            corr = df[feature].corr(df['TARGET'])

            risk_correlations.append((feature, corr))

    

    # Sort by strength of correlation

    risk_correlations.sort(key=lambda x: abs(x[1]), reverse=True)

    

    for feature, corr in risk_correlations[:5]:

        direction = "increases" if corr > 0 else "decreases"

        print(f"   • {feature}: {direction} default risk (correlation: {corr:.3f})")

    

    # 3. Model performance overview

    print(f"\n3. MODEL PERFORMANCE SUMMARY:")

    best_model = max(model_results.keys(), key=lambda x: model_results[x]['AUC'])

    print(f"   • Best Performing Model: {best_model}")

    print(f"   • AUC Score: {model_results[best_model]['AUC']:.4f}")

    print(f"   • Cross-Validation AUC: {model_results[best_model]['CV_AUC_Mean']:.4f}")

    

    # Ranking models by AUC

    print(f"\n   Model Ranking by AUC:")

    for i, (model, results) in enumerate(sorted(model_results.items(), 

                                                key=lambda x: x[1]['AUC'], reverse=True), 1):

        print(f"   {i}. {model}: {results['AUC']:.4f}")

    

    # 4. Business insights based on segmentation

    print(f"\n4. BUSINESS INSIGHTS:")

    

    # Income-based default analysis

    high_income = df[df['AMT_INCOME_TOTAL'] > df['AMT_INCOME_TOTAL'].quantile(0.75)]

    low_income = df[df['AMT_INCOME_TOTAL'] <= df['AMT_INCOME_TOTAL'].quantile(0.25)]

    print(f"   • High Income Segment Default Rate: {high_income['TARGET'].mean():.2%}")

    print(f"   • Low Income Segment Default Rate: {low_income['TARGET'].mean():.2%}")

    

    # Credit-to-income ratio impact

    if 'CREDIT_INCOME_RATIO' in df.columns:

        high_leverage = df[df['CREDIT_INCOME_RATIO'] > 3]

        print(f"   • High Leverage (Credit/Income > 3x) Default Rate: {high_leverage['TARGET'].mean():.2%}")

    

    # Age-based analysis

    if 'AGE_YEARS' in df.columns:

        young_applicants = df[df['AGE_YEARS'] < 30]

        mature_applicants = df[df['AGE_YEARS'] >= 45]

        print(f"   • Young Applicants (<30) Default Rate: {young_applicants['TARGET'].mean():.2%}")

        print(f"   • Mature Applicants (45+) Default Rate: {mature_applicants['TARGET'].mean():.2%}")

   

    # 5. Recommendations based on findings

    print(f"\n5. RECOMMENDATIONS:")

    print("   • Implement stricter criteria for high credit-to-income ratio applications")

    print("   • Enhanced verification for applicants with limited credit history")

    print("   • Consider age-based risk adjustments in pricing models")

    print("   • Focus on external data sources for better risk assessment")

    print("   • Regular model retraining with updated data")

    

# Generate final report

generate_risk_analysis_report(df_processed, model_results, feature_names)

Output:

LOAN DEFAULT RISK ANALYSIS - FINAL REPORT

1. DATASET ANALYSIS:

• Total Loan Applications: 65,980

• Default Rate: 8.05%

• Features Analyzed: 132

2. KEY FINANCIAL RISK INDICATORS:

• EXT_SOURCE_MEAN: decreases default risk (correlation: -0.219)

• AGE_YEARS: decreases default risk (correlation: -0.078)

• EMPLOYMENT_YEARS: decreases default risk (correlation: -0.047)

• ANNUITY_INCOME_RATIO: increases default risk (correlation: 0.019)

• CREDIT_INCOME_RATIO: decreases default risk (correlation: -0.007)

3. MODEL PERFORMANCE SUMMARY:

• Best Performing Model: Gradient Boosting

• AUC Score: 0.7551

• Cross-Validation AUC: 0.7440

Model Ranking by AUC:

1. Gradient Boosting: 0.7551

2. Logistic Regression: 0.7467

3. XGBoost: 0.7172

4. Random Forest: 0.7164

5. Decision Tree: 0.6871

4. BUSINESS INSIGHTS:

• High Income Segment Default Rate: 6.74%

• Low Income Segment Default Rate: 8.35%

• High Leverage (Credit/Income > 3x) Default Rate: 8.04%

• Young Applicants (<30) Default Rate: 11.40%

• Mature Applicants (45+) Default Rate: 6.11%

Conclusion: The loan default risk analysis identified key financial indicators influencing default probability, with EXT_SOURCE_MEAN and AGE_YEARS significantly reducing risk. Gradient Boosting emerged as the top-performing model with an AUC of 0.7551. Default risk is higher among younger applicants and those with lower income, suggesting the need for targeted policies such as stricter approval for high-leverage cases and age-adjusted pricing.

Final Conclusion: What We Learned from the Loan Default Risk Prediction Project

This loan default risk prediction project provided a complete, practical walk-through of credit risk modeling in the financial domain. It not only highlighted key risk drivers but also demonstrated the end-to-end data science workflow with real-world relevance. Here's what we achieved:

Skills Demonstrated

Financial Feature Engineering: Created domain-specific ratios like CREDIT_INCOME_RATIO and ANNUITY_INCOME_RATIO, and derived variables like AGE_YEARS and EMPLOYMENT_YEARS to capture applicant risk more effectively.
Risk Modeling & Credit Scoring: Applied and compared multiple classification models (Logistic Regression, Decision Trees, Gradient Boosting, etc.), with Gradient Boosting achieving the best AUC of 0.7551.
Business Risk Insights: Identified patterns in default rates across income levels, age groups, and leverage segments to draw actionable recommendations for lending policies.
Model Evaluation: Used AUC scores and cross-validation to rank model performance and avoid overfitting, ensuring generalizable results.
Credit Scoring System: Developed a simplified, scalable scoring system that outputs credit score, default probability, risk category, and final lending decision for new applicants.
Data Storytelling & Reporting: Delivered an executive-style risk analysis report with KPIs, correlation insights, and clear business recommendations.

In short, this project covered the full credit risk modeling pipeline—from raw data to deployment-ready scoring—and serves as a strong portfolio example for aspiring data scientists in fintech or banking analytics.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Data Science Career Path: A Comprehensive Career Guide	Data Science Career Growth: The Future of Work is here	Why is Data Science Important? 8 Ways Data Science Brings Value to the Business
Relevance of Data Science for Managers	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1TM6mg0qMczKaJiVfvewkOolHrffKTwJ2?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the objective of this project from a business standpoint?

The core goal is to build a system that predicts whether a loan applicant is likely to default. This helps lenders make faster, more data-driven decisions, reduce risk exposure, and optimize loan approvals without relying solely on manual checks or outdated credit scoring rules.

2. How does this project demonstrate real-world data science skills?

It applies end-to-end techniques used in the industry, starting from data cleaning, feature engineering, and EDA, to building and comparing models like Logistic Regression and Gradient Boosting. It also includes model evaluation using AUC and CV scores, which are standard in risk modeling tasks.

3. What kind of patterns or insights were discovered from the data?

The project uncovered several strong correlations, such as applicants with lower external scores or shorter employment histories being more prone to default. It also showed that younger applicants and those with high credit-to-income ratios carry higher risk, providing data-backed reasoning for business decisions.

4. Why is this project valuable from a financial analytics perspective?

It translates raw financial and demographic data into actionable risk profiles. The ability to quantify risk using historical default trends allows institutions to develop fairer credit policies, manage portfolio risk, and improve profitability by reducing non-performing loans.

5. How can this project be extended or applied in real systems?

This framework can be integrated into a real-time credit scoring API or dashboard. It can be extended with real-time data sources (e.g., social credit scores, alternative data), updated frequently with new customer data, and used by underwriters or lending apps for automated decision-making.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources