Customer Churn Prediction Project: From Data to Decisions
By Rohit Sharma
Updated on Jul 25, 2025 | 9 min read | 1.48K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 25, 2025 | 9 min read | 1.48K+ views
Share:
Table of Contents
Customer Churn is the most important part for a business's growth and revenue. This project focuses on building a Customer Churn Prediction model using Python and machine learning to find out the at-risk customers based on their behavior and transaction history.
You'll work with real-world e-commerce and subscription-based datasets to find out the patterns that are important for customer retention.
Ready to go beyond theory? Master Python, Machine Learning, and more with upGrad’s expert-led Data Science Courses. Learn fast, get job-ready. Explore now.
Turn code into confidence. Explore our top Python data science projects and start building skills that recruiters notice.
Popular Data Science Programs
It’s good to have some basic knowledge of the following before starting this Customer Churn Prediction project:
Don’t just learn data science, master it with guidance from industry pros. upGrad’s top-rated courses connect you with expert mentors who help shape your career every step of the way. Learn from the best to become the best.
This project is best if you're comfortable with Python and want to apply machine learning to real customer data.
For making the Customer Churn Prediction project, we used the following tools and libraries:
Tool / Library |
Purpose |
Python | Writing code to clean data, build churn models, and automate tasks |
Pandas | Reading customer data, handling missing values, and feature engineering |
NumPy | Performing quick mathematical operations and managing arrays |
Seaborn | Visualizing churn patterns with plots like heatmaps, bar charts, and distributions |
Scikit-learn | Building and evaluating classification models like logistic regression, decision trees, and random forests |
Jupyter/Colab | Running code interactively, testing models, and visualizing outputs in one place |
Also Read - Step-by-Step Guide to Learning Python for Data Science
To predict which customers are likely to churn, you will apply machine learning techniques combined with behavioral insights. Here's what you have to focus on:
Also Read - Clustering vs Classification: Difference Between Clustering & Classification
We’ll walk through each step to develop a churn prediction model from scratch:
Without any further delay, let’s get started!
Download customer data from Kaggle by searching " Customer Chunk Prediction," downloading the ZIP file, extracting it, and using the CSV file for analysis.
Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, import the required libraries and use the following Python code to read and check the data:
import pandas as pd
# Load the dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
# Show number of rows and columns
print(df.shape)
# Display first few rows of the data
print(df.head())
# Count how many customers churned vs stayed
print(df['Churn'].value_counts())
Output :
Dataset Overview
Total Rows |
Total Columns |
7043 | 21 |
Sample Rows from Dataset :
customerID |
gender |
SeniorCitizen |
Partner |
Dependents |
tenure |
PhoneService |
InternetService |
OnlineSecurity |
... |
Churn |
7590-VHVEG | Female | 0 | Yes | No | 1 | No | DSL | No | ... | No |
5575-GNVDE | Male | 0 | No | No | 34 | Yes | DSL | Yes | ... | No |
3668-QPYBK | Male | 0 | No | No | 2 | Yes | DSL | Yes | ... | Yes |
7795-CFOCW | Male | 0 | No | No | 45 | No | DSL | Yes | ... | No |
9237-HQITU | Female | 0 | No | No | 2 | Yes | Fiber optic | No | ... | Yes |
Churn Distribution :
Churn |
Count |
No | 5174 |
Yes | 1869 |
Key Financial Metrics:
Conclusion: Out of 7043 customers, around 27% have churned, indicating a big customer loss trend.
Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!
To make the dataset clean and ready for modeling, we need to fix the TotalCharges column, which may contain non-numeric or missing entries.
Here is the code:
# Convert 'TotalCharges' column to numeric format and turn non-numeric values into NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Replace missing values in 'TotalCharges' with the column's median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)
Output:
All missing values in the TotalCharges column have been successfully filled with the median value, so the column is now fully numeric and ready for analysis or model training.
This step prepares categorical data for modeling and gives a quick look at churn distribution.
Here is the code:
from sklearn.preprocessing import LabelEncoder
# Get all object-type columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()
# Remove 'customerID' and 'Churn' from encoding
categorical_cols.remove('customerID') # Unique, not useful for model
categorical_cols.remove('Churn') # Target variable, handled separately
# Encode categorical columns with LabelEncoder
le = LabelEncoder()
for col in categorical_cols:
df[col] = le.fit_transform(df[col])
# Convert 'Churn' to binary values: No → 0, Yes → 1
df['Churn'] = df['Churn'].map({'No': 0, 'Yes': 1})
import matplotlib.pyplot as plt
import seaborn as sns
# Plot churn count
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()
Output:
Conclusion: Categorical features are now numerically encoded, and the churn plot shows class imbalance with more non-churned customers than churned ones.
Also Read - Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025
This step explores how tenure and monthly charges differ between churned and retained customers.
Here is the code:
# Compare customer tenure between churned and non-churned groups
sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure vs Churn')
plt.show()
# Compare monthly charges between churned and non-churned groups
sns.boxplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Monthly Charges vs Churn')
plt.show()
Output:
Conclusion: Churned customers tend to have shorter tenure and higher monthly charges compared to those who stayed.
This step determines the strength of each feature's relationship to customer churn.
Here is the Code:
# Get correlation of all features with churn, sorted from high to low
corr = df.drop('customerID', axis=1).corr()['Churn'].sort_values(ascending=False)
print(corr)
Output:
Feature |
Correlation with Churn |
Churn | 1.000 |
MonthlyCharges | 0.193 |
PaperlessBilling | 0.192 |
SeniorCitizen | 0.151 |
PaymentMethod | 0.107 |
MultipleLines | 0.038 |
PhoneService | 0.012 |
gender | -0.009 |
StreamingTV | -0.037 |
StreamingMovies | -0.038 |
InternetService | -0.047 |
Partner | -0.150 |
Dependents | -0.164 |
DeviceProtection | -0.178 |
OnlineBackup | -0.196 |
TotalCharges | -0.199 |
TechSupport | -0.282 |
OnlineSecurity | -0.289 |
tenure | -0.352 |
Contract | -0.397 |
Conclusion: Customers with short tenure, month-to-month contracts, no online security or tech support, and higher monthly charges are more possibilities to churn.
Also Read - Correlation in Statistics: Definition, Types, Calculation, and Real-World Applications
This block of code separates the dataset into features and the target, then splits it into training and testing sets.
from sklearn.model_selection import train_test_split
# Define input features and target
features = df.columns.drop(['customerID', 'Churn'])
X = df[features]
y = df['Churn']
# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
We’ll use three popular classification models to predict customer churn: Logistic Regression, Decision Tree, and Random Forest. After training, we evaluate each model using precision, recall, F1-score, and ROC AUC.
Here is the code for evaluation :
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("=== Logistic Regression ===")
print(classification_report(y_test, y_pred_lr))
print("ROC AUC:", roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1]))
print("\n")
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("=== Decision Tree ===")
print(classification_report(y_test, y_pred_dt))
print("\n")
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("=== Random Forest ===")
print(classification_report(y_test, y_pred_rf))
Output:
Model |
Class |
Precision |
Recall |
F1-Score |
Support |
Logistic Regression | 0 | 0.85 | 0.89 | 0.87 | 1035 |
1 | 0.64 | 0.55 | 0.59 | 374 | |
Accuracy | 0.80 | 1409 | |||
Macro Avg | 0.74 | 0.72 | 0.73 | ||
Weighted Avg | 0.79 | 0.80 | 0.79 | ||
Decision Tree | 0 | 0.84 | 0.87 | 0.86 | 1035 |
1 | 0.60 | 0.55 | 0.58 | 374 | |
Accuracy | 0.78 | 1409 | |||
Macro Avg | 0.72 | 0.71 | 0.72 | ||
Weighted Avg | 0.78 | 0.78 | 0.78 | ||
Random Forest | 0 | 0.83 | 0.90 | 0.86 | 1035 |
1 | 0.64 | 0.50 | 0.56 | 374 | |
Accuracy | 0.79 | 1409 | |||
Macro Avg | 0.74 | 0.70 | 0.71 | ||
Weighted Avg | 0.78 | 0.79 | 0.78 |
Conclusion:
Also Read - 45+ Key Interview Questions on Logistic Regression [Freshers & Experienced]
This step helps you understand which features had the most impact on churn prediction based on the Random Forest model.
Here is the code for the same :
# Get and plot top 10 important features from the Random Forest model
importances = pd.Series(rf.feature_importances_, index=features)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Feature Importances (Random Forest)')
plt.show()
Output:
Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics
This project helped us explore and predict customer churn using telecom data. We cleaned the dataset, handled missing values, and encoded categorical features to prepare it for modeling. Visualizations highlighted key patterns between customer behavior and churn.
We trained Logistic Regression, Decision Tree, and Random Forest models. Logistic Regression performed the best with 80% accuracy. Important churn indicators included contract type, tenure, and monthly charges, giving clear direction for targeted retention strategies.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link-
https://colab.research.google.com/drive/1VRZ_tay9yEuSZLWf0VMuUyMsYxa-hz9Y
804 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources