Customer Churn Prediction Project: From Data to Decisions

By Rohit Sharma

Updated on Jul 25, 2025 | 9 min read | 1.48K+ views

Share:

Customer Churn is the most important part for a business's growth and revenue. This project focuses on building a Customer Churn Prediction model using Python and machine learning to find out the at-risk customers based on their behavior and transaction history.

You'll work with real-world e-commerce and subscription-based datasets to find out the patterns that are important for customer retention.

Ready to go beyond theory? Master Python, Machine Learning, and more with upGrad’s expert-led Data Science Courses. Learn fast, get job-ready. Explore now.

Turn code into confidence. Explore our top Python data science projects and start building skills that recruiters notice.

Before You Start: Skills You Should Know

It’s good to have some basic knowledge of the following before starting this Customer Churn Prediction project:

  • Python programming (variables, functions, loops, basic syntax)
  • Pandas and Numpy (for handling and analyzing data)
  • Matplotlib or Seaborn (for creating charts and visualizing trends)
  • Scikit-learn (for building and evaluating classification and regression models)
  • Business metrics like customer tenure, subscription type, and activity level
  • Handling real-world data (dealing with missing values, outliers, and encoding categorical features)

Don’t just learn data science, master it with guidance from industry pros. upGrad’s top-rated courses connect you with expert mentors who help shape your career every step of the way. Learn from the best to become the best.

Project Duration and Learning Outcomes

  • Time required: Around 2 to 3 hours
  • Difficulty level: Moderate

This project is best if you're comfortable with Python and want to apply machine learning to real customer data. 

Tech Stack Used: 

For making the Customer Churn Prediction project, we used the following tools and libraries:

Tool / Library

Purpose

Python Writing code to clean data, build churn models, and automate tasks
Pandas Reading customer data, handling missing values, and feature engineering
NumPy Performing quick mathematical operations and managing arrays
Seaborn Visualizing churn patterns with plots like heatmaps, bar charts, and distributions
Scikit-learn Building and evaluating classification models like logistic regression, decision trees, and random forests
Jupyter/Colab Running code interactively, testing models, and visualizing outputs in one place

Also Read - Step-by-Step Guide to Learning Python for Data Science

Models That Perform Churn Predictions

To predict which customers are likely to churn, you will apply machine learning techniques combined with behavioral insights. Here's what you have to focus on:

  • Classification Models – Used algorithms like logistic regression, decision trees, and random forests to classify whether a customer will churn.
  • Feature Engineering – Created features like usage frequency, subscription duration, and last activity date to improve model accuracy.
  • Behavioral Analysis – Identified key churn indicators such as drop in engagement, billing issues, or lack of recent activity.
  • Data Visualization – Built charts to track churn trends across subscription types, demographics, and customer segments.

Also Read - Clustering vs Classification: Difference Between Clustering & Classification

From Data to Decisions: Creating a Customer Churn Model

We’ll walk through each step to develop a churn prediction model from scratch:

  • Load and Explore the Churn Dataset
  • Clean and Prepare Customer Data
  • Engineer Behavioral and Transactional Features
  • Visualize Churn Trends and Patterns
  • Train Classification Models (Logistic, Trees, etc.)
  • Evaluate Model Accuracy and Insights

Without any further delay, let’s get started!

Step 1: Download the Dataset

Download customer data from Kaggle by searching " Customer Chunk Prediction," downloading the ZIP file, extracting it, and using the CSV file for analysis.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Load and Understand the Churn Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, import the required libraries and use the following Python code to read and check the data:

import pandas as pd
# Load the dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
# Show number of rows and columns
print(df.shape)
# Display first few rows of the data
print(df.head())
# Count how many customers churned vs stayed
print(df['Churn'].value_counts())

Output : 

Dataset Overview 

Total Rows

Total Columns

7043 21

Sample Rows from Dataset : 

customerID

gender

SeniorCitizen

Partner

Dependents

tenure

PhoneService

InternetService

OnlineSecurity

...

Churn

7590-VHVEG Female 0 Yes No 1 No DSL No ... No
5575-GNVDE Male 0 No No 34 Yes DSL Yes ... No
3668-QPYBK Male 0 No No 2 Yes DSL Yes ... Yes
7795-CFOCW Male 0 No No 45 No DSL Yes ... No
9237-HQITU Female 0 No No 2 Yes Fiber optic No ... Yes

 

Churn Distribution :

Churn

Count

No 5174
Yes 1869

Key Financial Metrics:

Conclusion: Out of 7043 customers, around 27% have churned, indicating a big customer loss trend.

Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!

Step 3: Handling Missing and Incorrect Values in TotalCharges

To make the dataset clean and ready for modeling, we need to fix the TotalCharges column, which may contain non-numeric or missing entries.

Here is the code:

# Convert 'TotalCharges' column to numeric format and turn non-numeric values into NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Replace missing values in 'TotalCharges' with the column's median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

Output: 

All missing values in the TotalCharges column have been successfully filled with the median value, so the column is now fully numeric and ready for analysis or model training.

Step 4: Encoding Categorical Features and Visualizing Churn

This step prepares categorical data for modeling and gives a quick look at churn distribution.

Here is the code:

from sklearn.preprocessing import LabelEncoder
# Get all object-type columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()
# Remove 'customerID' and 'Churn' from encoding
categorical_cols.remove('customerID')  # Unique, not useful for model
categorical_cols.remove('Churn')       # Target variable, handled separately
# Encode categorical columns with LabelEncoder
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])
# Convert 'Churn' to binary values: No → 0, Yes → 1
df['Churn'] = df['Churn'].map({'No': 0, 'Yes': 1})
import matplotlib.pyplot as plt
import seaborn as sns
# Plot churn count
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()

Output: 

Conclusion: Categorical features are now numerically encoded, and the churn plot shows class imbalance with more non-churned customers than churned ones.

Also Read - Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025

Step 5: Visualizing Tenure and Monthly Charges by Churn Status

This step explores how tenure and monthly charges differ between churned and retained customers.

Here is the code:

# Compare customer tenure between churned and non-churned groups
sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure vs Churn')
plt.show()
# Compare monthly charges between churned and non-churned groups
sns.boxplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Monthly Charges vs Churn')
plt.show()

Output:

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Conclusion: Churned customers tend to have shorter tenure and higher monthly charges compared to those who stayed.

Step 6: Correlation Analysis with Churn

This step determines the strength of each feature's relationship to customer churn.

Here is the Code:

# Get correlation of all features with churn, sorted from high to low
corr = df.drop('customerID', axis=1).corr()['Churn'].sort_values(ascending=False)
print(corr)

Output: 

Feature

Correlation with Churn

Churn 1.000
MonthlyCharges 0.193
PaperlessBilling 0.192
SeniorCitizen 0.151
PaymentMethod 0.107
MultipleLines 0.038
PhoneService 0.012
gender -0.009
StreamingTV -0.037
StreamingMovies -0.038
InternetService -0.047
Partner -0.150
Dependents -0.164
DeviceProtection -0.178
OnlineBackup -0.196
TotalCharges -0.199
TechSupport -0.282
OnlineSecurity -0.289
tenure -0.352
Contract -0.397

Conclusion: Customers with short tenure, month-to-month contracts, no online security or tech support, and higher monthly charges are more possibilities to churn.

Also Read - Correlation in Statistics: Definition, Types, Calculation, and Real-World Applications

Step 7: Splitting the Data for Training and Testing

This block of code separates the dataset into features and the target, then splits it into training and testing sets.

from sklearn.model_selection import train_test_split
# Define input features and target
features = df.columns.drop(['customerID', 'Churn'])
X = df[features]
y = df['Churn']
# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

Step 8: Training and Evaluating ML Models for Churn Prediction

We’ll use three popular classification models to predict customer churn: Logistic RegressionDecision Tree, and Random Forest. After training, we evaluate each model using precision, recall, F1-score, and ROC AUC.

Here is the code for evaluation : 

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("=== Logistic Regression ===")
print(classification_report(y_test, y_pred_lr))
print("ROC AUC:", roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1]))
print("\n")
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("=== Decision Tree ===")
print(classification_report(y_test, y_pred_dt))
print("\n")
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("=== Random Forest ===")
print(classification_report(y_test, y_pred_rf))

Output:

Model

Class

Precision

Recall

F1-Score

Support

Logistic Regression 0 0.85 0.89 0.87 1035
  1 0.64 0.55 0.59 374
  Accuracy     0.80 1409
  Macro Avg 0.74 0.72 0.73  
  Weighted Avg 0.79 0.80 0.79  
Decision Tree 0 0.84 0.87 0.86 1035
  1 0.60 0.55 0.58 374
  Accuracy     0.78 1409
  Macro Avg 0.72 0.71 0.72  
  Weighted Avg 0.78 0.78 0.78  
Random Forest 0 0.83 0.90 0.86 1035
  1 0.64 0.50 0.56 374
  Accuracy     0.79 1409
  Macro Avg 0.74 0.70 0.71  
  Weighted Avg 0.78 0.79 0.78  

 

Conclusion: 

  • Logistic Regression delivered the best F1-score for identifying churners (class 1) among the three models, with an overall accuracy of 80%.
  • Decision Tree had slightly lower performance but gave interpretable results with a balanced score.
  • Random Forest showed strong precision and recall for non-churners but struggled more with churners, resulting in lower recall (0.50).

Also Read - 45+ Key Interview Questions on Logistic Regression [Freshers & Experienced]

Step 9: Top Features That Influence Churn

This step helps you understand which features had the most impact on churn prediction based on the Random Forest model.

Here is the code for the same :

# Get and plot top 10 important features from the Random Forest model
importances = pd.Series(rf.feature_importances_, index=features)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Feature Importances (Random Forest)')
plt.show()

Output: 

Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics

Final Conclusion: What We Learned from the Customer Churn Prediction Project

This project helped us explore and predict customer churn using telecom data. We cleaned the dataset, handled missing values, and encoded categorical features to prepare it for modeling. Visualizations highlighted key patterns between customer behavior and churn.

We trained Logistic Regression, Decision Tree, and Random Forest models. Logistic Regression performed the best with 80% accuracy. Important churn indicators included contract type, tenure, and monthly charges, giving clear direction for targeted retention strategies.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link-
https://colab.research.google.com/drive/1VRZ_tay9yEuSZLWf0VMuUyMsYxa-hz9Y

Frequently Asked Questions (FAQs)

1. What is the objective of this project?

2. Which machine learning models were used, and why?

3. What were the most important features influencing churn?

4. How well did the models perform?

5. How can businesses benefit from this churn prediction model?

Rohit Sharma

804 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months