Home
Blog
Data Science
Customer Churn Prediction Project: From Data to Decisions

Customer Churn Prediction Project: From Data to Decisions

Updated on Jul 25, 2025 | 9 min read | 2.07K+ views

Table of Contents

View all

Before You Start: Skills You Should Know
Project Duration and Learning Outcomes
Tech Stack Used:
Models That Perform Churn Predictions
From Data to Decisions: Creating a Customer Churn Model
Final Conclusion: What We Learned from the Customer Churn Prediction Project

Customer Churn is the most important part for a business's growth and revenue. This project focuses on building a Customer Churn Prediction model using Python and machine learning to find out the at-risk customers based on their behavior and transaction history.

You'll work with real-world e-commerce and subscription-based datasets to find out the patterns that are important for customer retention.

Ready to go beyond theory? Master Python, Machine Learning, and more with upGrad’s expert-led Data Science Courses. Learn fast, get job-ready. Explore now.

Turn code into confidence. Explore our top Python data science projects and start building skills that recruiters notice.

Popular Data Science Programs

MS in Data Science PGD in Data Science DevOps Course Online Post Graduate Certificate in Data Science MSc in Data Science Program

Before You Start: Skills You Should Know

It’s good to have some basic knowledge of the following before starting this Customer Churn Prediction project:

Python programming (variables, functions, loops, basic syntax)
Pandas and Numpy (for handling and analyzing data)
Matplotlib or Seaborn (for creating charts and visualizing trends)
Scikit-learn (for building and evaluating classification and regression models)
Business metrics like customer tenure, subscription type, and activity level
Handling real-world data (dealing with missing values, outliers, and encoding categorical features)

Don’t just learn data science, master it with guidance from industry pros. upGrad’s top-rated courses connect you with expert mentors who help shape your career every step of the way. Learn from the best to become the best.

Project Duration and Learning Outcomes

Time required: Around 2 to 3 hours
Difficulty level: Moderate

This project is best if you're comfortable with Python and want to apply machine learning to real customer data.

Tech Stack Used:

For making the Customer Churn Prediction project, we used the following tools and libraries:

Tool / Library	Purpose
Python	Writing code to clean data, build churn models, and automate tasks
Pandas	Reading customer data, handling missing values, and feature engineering
NumPy	Performing quick mathematical operations and managing arrays
Seaborn	Visualizing churn patterns with plots like heatmaps, bar charts, and distributions
Scikit-learn	Building and evaluating classification models like logistic regression, decision trees, and random forests
Jupyter/Colab	Running code interactively, testing models, and visualizing outputs in one place

Also Read - Step-by-Step Guide to Learning Python for Data Science

Models That Perform Churn Predictions

To predict which customers are likely to churn, you will apply machine learning techniques combined with behavioral insights. Here's what you have to focus on:

Classification Models – Used algorithms like logistic regression, decision trees, and random forests to classify whether a customer will churn.
Feature Engineering – Created features like usage frequency, subscription duration, and last activity date to improve model accuracy.
Behavioral Analysis – Identified key churn indicators such as drop in engagement, billing issues, or lack of recent activity.
Data Visualization – Built charts to track churn trends across subscription types, demographics, and customer segments.

Also Read - Clustering vs Classification: Difference Between Clustering & Classification

From Data to Decisions: Creating a Customer Churn Model

We’ll walk through each step to develop a churn prediction model from scratch:

Load and Explore the Churn Dataset
Clean and Prepare Customer Data
Engineer Behavioral and Transactional Features
Visualize Churn Trends and Patterns
Train Classification Models (Logistic, Trees, etc.)
Evaluate Model Accuracy and Insights

Without any further delay, let’s get started!

Step 1: Download the Dataset

Download customer data from Kaggle by searching " Customer Chunk Prediction," downloading the ZIP file, extracting it, and using the CSV file for analysis.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Load and Understand the Churn Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, import the required libraries and use the following Python code to read and check the data:

import pandas as pd
# Load the dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
# Show number of rows and columns
print(df.shape)
# Display first few rows of the data
print(df.head())
# Count how many customers churned vs stayed
print(df['Churn'].value_counts())

Output :

Dataset Overview

Total Rows	Total Columns
7043	21

Sample Rows from Dataset :

customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	InternetService	OnlineSecurity	...	Churn
7590-VHVEG	Female	0	Yes	No	1	No	DSL	No	...	No
5575-GNVDE	Male	0	No	No	34	Yes	DSL	Yes	...	No
3668-QPYBK	Male	0	No	No	2	Yes	DSL	Yes	...	Yes
7795-CFOCW	Male	0	No	No	45	No	DSL	Yes	...	No
9237-HQITU	Female	0	No	No	2	Yes	Fiber optic	No	...	Yes

Churn Distribution :

Churn	Count
No	5174
Yes	1869

Key Financial Metrics:

Conclusion: Out of 7043 customers, around 27% have churned, indicating a big customer loss trend.

Also Read - Top 6 Python IDEs of 2025 That Will Change Your Workflow!

Step 3: Handling Missing and Incorrect Values in TotalCharges

To make the dataset clean and ready for modeling, we need to fix the TotalCharges column, which may contain non-numeric or missing entries.

Here is the code:

# Convert 'TotalCharges' column to numeric format and turn non-numeric values into NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Replace missing values in 'TotalCharges' with the column's median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

Output:

All missing values in the TotalCharges column have been successfully filled with the median value, so the column is now fully numeric and ready for analysis or model training.

Step 4: Encoding Categorical Features and Visualizing Churn

This step prepares categorical data for modeling and gives a quick look at churn distribution.

Here is the code:

from sklearn.preprocessing import LabelEncoder
# Get all object-type columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()
# Remove 'customerID' and 'Churn' from encoding
categorical_cols.remove('customerID')  # Unique, not useful for model
categorical_cols.remove('Churn')       # Target variable, handled separately
# Encode categorical columns with LabelEncoder
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])
# Convert 'Churn' to binary values: No → 0, Yes → 1
df['Churn'] = df['Churn'].map({'No': 0, 'Yes': 1})
import matplotlib.pyplot as plt
import seaborn as sns
# Plot churn count
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()

Output:

Conclusion: Categorical features are now numerically encoded, and the churn plot shows class imbalance with more non-churned customers than churned ones.

Also Read - Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025

Step 5: Visualizing Tenure and Monthly Charges by Churn Status

This step explores how tenure and monthly charges differ between churned and retained customers.

Here is the code:

# Compare customer tenure between churned and non-churned groups
sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure vs Churn')
plt.show()
# Compare monthly charges between churned and non-churned groups
sns.boxplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Monthly Charges vs Churn')
plt.show()

Output:

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Conclusion: Churned customers tend to have shorter tenure and higher monthly charges compared to those who stayed.

Step 6: Correlation Analysis with Churn

This step determines the strength of each feature's relationship to customer churn.

Here is the Code:

# Get correlation of all features with churn, sorted from high to low
corr = df.drop('customerID', axis=1).corr()['Churn'].sort_values(ascending=False)
print(corr)

Output:

Feature	Correlation with Churn
Churn	1.000
MonthlyCharges	0.193
PaperlessBilling	0.192
SeniorCitizen	0.151
PaymentMethod	0.107
MultipleLines	0.038
PhoneService	0.012
gender	-0.009
StreamingTV	-0.037
StreamingMovies	-0.038
InternetService	-0.047
Partner	-0.150
Dependents	-0.164
DeviceProtection	-0.178
OnlineBackup	-0.196
TotalCharges	-0.199
TechSupport	-0.282
OnlineSecurity	-0.289
tenure	-0.352
Contract	-0.397

Conclusion: Customers with short tenure, month-to-month contracts, no online security or tech support, and higher monthly charges are more possibilities to churn.

Also Read - Correlation in Statistics: Definition, Types, Calculation, and Real-World Applications

Step 7: Splitting the Data for Training and Testing

This block of code separates the dataset into features and the target, then splits it into training and testing sets.

from sklearn.model_selection import train_test_split
# Define input features and target
features = df.columns.drop(['customerID', 'Churn'])
X = df[features]
y = df['Churn']
# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

Step 8: Training and Evaluating ML Models for Churn Prediction

We’ll use three popular classification models to predict customer churn: Logistic Regression, Decision Tree, and Random Forest. After training, we evaluate each model using precision, recall, F1-score, and ROC AUC.

Here is the code for evaluation :

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("=== Logistic Regression ===")
print(classification_report(y_test, y_pred_lr))
print("ROC AUC:", roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1]))
print("\n")
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("=== Decision Tree ===")
print(classification_report(y_test, y_pred_dt))
print("\n")
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("=== Random Forest ===")
print(classification_report(y_test, y_pred_rf))

Output:

Model	Class	Precision	Recall	F1-Score	Support
Logistic Regression	0	0.85	0.89	0.87	1035
	1	0.64	0.55	0.59	374
	Accuracy			0.80	1409
	Macro Avg	0.74	0.72	0.73
	Weighted Avg	0.79	0.80	0.79
Decision Tree	0	0.84	0.87	0.86	1035
	1	0.60	0.55	0.58	374
	Accuracy			0.78	1409
	Macro Avg	0.72	0.71	0.72
	Weighted Avg	0.78	0.78	0.78
Random Forest	0	0.83	0.90	0.86	1035
	1	0.64	0.50	0.56	374
	Accuracy			0.79	1409
	Macro Avg	0.74	0.70	0.71
	Weighted Avg	0.78	0.79	0.78

Conclusion:

Logistic Regression delivered the best F1-score for identifying churners (class 1) among the three models, with an overall accuracy of 80%.
Decision Tree had slightly lower performance but gave interpretable results with a balanced score.
Random Forest showed strong precision and recall for non-churners but struggled more with churners, resulting in lower recall (0.50).

Also Read - 45+ Key Interview Questions on Logistic Regression [Freshers & Experienced]

Step 9: Top Features That Influence Churn

This step helps you understand which features had the most impact on churn prediction based on the Random Forest model.

Here is the code for the same :

# Get and plot top 10 important features from the Random Forest model
importances = pd.Series(rf.feature_importances_, index=features)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Feature Importances (Random Forest)')
plt.show()

Output:

Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics

Final Conclusion: What We Learned from the Customer Churn Prediction Project

This project helped us explore and predict customer churn using telecom data. We cleaned the dataset, handled missing values, and encoded categorical features to prepare it for modeling. Visualizations highlighted key patterns between customer behavior and churn.

We trained Logistic Regression, Decision Tree, and Random Forest models. Logistic Regression performed the best with 80% accuracy. Important churn indicators included contract type, tenure, and monthly charges, giving clear direction for targeted retention strategies.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link-
https://colab.research.google.com/drive/1VRZ_tay9yEuSZLWf0VMuUyMsYxa-hz9Y

Frequently Asked Questions (FAQs)

1. What is the objective of this project?

The project aims to predict customer churn for a telecom company using machine learning. By analyzing customer behavior and service usage, the goal is to identify which customers are likely to stop using the service, so that businesses can take preventive actions.

2. Which machine learning models were used, and why?

We applied Logistic Regression, Decision Tree, and Random Forest models. Logistic Regression was used for its interpretability, Decision Tree for visual understanding of decision paths, and Random Forest for handling complex feature interactions and improving overall prediction accuracy.

3. What were the most important features influencing churn?

Features like contract type, tenure, monthly charges, tech support, and online security had the strongest correlation with churn. Customers with month-to-month contracts and no security services were more likely to churn.

4. How well did the models perform?

Logistic Regression achieved the highest accuracy (around 80%), while Random Forest and Decision Tree followed closely behind. Although Logistic Regression had better precision and recall for non-churners, its recall for churners was slightly lower, highlighting a common trade-off.

5. How can businesses benefit from this churn prediction model?

By identifying churn-prone customers early, companies can offer personalized retention strategies—such as discounts, contract upgrades, or improved service support—to reduce customer loss and improve lifetime value.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources