Customer Purchase Behavior Analysis Project Using Python
By Rohit Sharma
Updated on Jul 24, 2025 | 11 min read | 1.36K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 24, 2025 | 11 min read | 1.36K+ views
Share:
Customer Purchase Behavior Analysis helps businesses get a clear picture of how their customers shop. By breaking down the data, it becomes easier to spot loyal or big-spending customers, find patterns in what people often buy together, and understand what types of shoppers are most valuable to the business.
In this project, we use Python to perform RFM analysis and K‑Means clustering on an e-commerce store's data to learn these concepts.
If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enroll today!
Spark your next big idea. Browse our full collection of data science projects in Python.
It’s helpful to have some basic knowledge of the following before starting this project:
Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:
For this Customer Purchase Behavior Analysis project, the following tools and libraries will be used:
Tool / Library |
Purpose |
Python | Programming language |
Google Colab | Cloud-based notebook for writing and running code |
Pandas | Data loading, cleaning, and manipulation |
NumPy | Efficient numerical operations |
Matplotlib / Seaborn | Data visualization |
Scikit‑learn | K‑Means clustering and RFM computations |
For our Customer Purchase Behavior Analysis, we’ll leverage three accessible yet powerful techniques:
You can complete this Customer Purchase Behavior Analysis project in about 2 to 3 hours. It’s ideal for beginners to intermediate users, offering hands‑on experience with customer segmentation and clustering using Python.
Let’s build this project from scratch with clear, step-by-step guidance:
With these steps, you’ll uncover actionable insights into customer behavior and build a robust analysis workflow using Python.
Without any further delay, let’s get started!
To analyze customer purchase behavior, we’ll use the Online Retail sample dataset from Kaggle.
Follow the steps below to download the dataset:
Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data:
# Step 1: Upload and Read the Dataset in Google Colab
import pandas as pd
df = pd.read_csv('shopping_behavior_updated.csv')
df.head() # Shows the first 5 rows of the data
Output :
Customer ID |
Age |
Gender |
Item Purchased |
Category |
Purchase Amount (USD) |
Location |
Size |
Color |
Season |
Review Rating |
Subscription Status |
Shipping Type |
Discount Applied |
Promo Code Used |
Previous Purchases |
Payment Method |
Frequency of Purchases |
|
0 |
1 |
55 |
Male |
Blouse |
Clothing |
53 |
Kentucky |
L |
Gray |
Winter |
3.1 |
Yes |
Express |
Yes |
Yes |
14 |
Venmo |
Fortnightly |
1 |
2 |
19 |
Male |
Sweater |
Clothing |
64 |
Maine |
L |
Maroon |
Winter |
3.1 |
Yes |
Express |
Yes |
Yes |
2 |
Cash |
Fortnightly |
2 |
3 |
50 |
Male |
Jeans |
Clothing |
73 |
Massachusetts |
S |
Maroon |
Spring |
3.1 |
Yes |
Free Shipping |
Yes |
Yes |
23 |
Credit Card |
Weekly |
3 |
4 |
21 |
Male |
Sandals |
Footwear |
90 |
Rhode Island |
M |
Maroon |
Spring |
3.5 |
Yes |
Next Day Air |
Yes |
Yes |
49 |
PayPal |
Weekly |
4 |
5 |
45 |
Male |
Blouse |
Clothing |
49 |
Oregon |
M |
Turquoise |
Spring |
2.7 |
Yes |
Free Shipping |
Yes |
Yes |
31 |
PayPal |
Annually |
To ensure clean and reliable analysis, we remove any rows with missing data:
Here is the code:
df.isnull().sum() # Check for any missing values in the data
# Remove rows that contain any missing (NaN) values
df = df.dropna()
Output
0 |
|
Customer ID | 0 |
Age | 0 |
Gender | 0 |
Item Purchased | 0 |
Category | 0 |
Purchase Amount (USD) | 0 |
Location | 0 |
Size | 0 |
Color | 0 |
Season | 0 |
Review Rating | 0 |
Subscription Status | 0 |
Shipping Type | 0 |
Discount Applied | 0 |
Promo Code Used | 0 |
Previous Purchases | 0 |
Payment Method | 0 |
Frequency of Purchases | 0 |
Understanding the distribution of customer spending is crucial before applying any clustering or segmentation techniques. Let’s begin by visualizing the Purchase Amount (USD).
Here is the code:
# Set the figure size for better visibility
plt.figure(figsize=(8, 4))
# Plot a histogram of the 'Purchase Amount (USD)' column with KDE
# This helps us see how the data is distributed
sns.histplot(df['Purchase Amount (USD)'], bins=30, kde=True)
# Add a title and labels for clarity
plt.title('Purchase Amount Distribution')
plt.xlabel('Purchase Amount (USD)')
plt.ylabel('Frequency')
# Show the plot
plt.show()
Output:
Popular Data Science Programs
To understand how spending differs across product categories, we’ll use a box plot. This helps identify:
What We’ll Do:
Here is the code:
# Create a box plot to compare spending across product categories
plt.figure(figsize=(10, 5))
sns.boxplot(x='Category', y='Purchase Amount (USD)', data=df) # Shows distribution per category
plt.title('Spending by Category')
plt.xlabel('Product Category')
plt.ylabel('Purchase Amount (USD)')
plt.grid(True)
plt.show()
Output:
We’ll now apply K-Means Clustering to group customers based on their purchasing behavior. This helps identify distinct segments like high spenders, frequent buyers, or one-time shoppers.
What We’ll Do:
Here is the code:
# Feature selection: Purchase amount and frequency
X = df[['Purchase Amount (USD)', 'Frequency of Purchases']]
# We'll assume X has already been scaled and stored in X_scaled
# This is important because KMeans is sensitive to feature magnitude
inertia = []
# Try clustering with K=1 to 10
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled) # Using the scaled version of X
inertia.append(kmeans.inertia_) # Save the sum of squared distances (inertia)
# Plot the inertia values to find the 'elbow'
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal K')
plt.show()
Output:
Once we know the optimal number of clusters (say, 3 from the elbow method), we use K-Means to assign each customer to a segment.
What We’ll Do:
Here is the code:
# Apply KMeans clustering with 3 clusters (based on elbow method)
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled) # Assign cluster labels to each customer
# Plot the customer segments
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Purchase Amount (USD)',
y='Frequency of Purchases',
hue='Cluster',
data=df,
palette='Set1')
plt.title('K-Means Customer Segments')
plt.xlabel('Purchase Amount (USD)')
plt.ylabel('Frequency of Purchases')
plt.grid(True)
plt.show()
Output:
In this step, we use RFM analysis and clustering to spot high-value customers and those at risk of churn, helping businesses focus on retention and personalized marketing.
Here is the code:
# RFM Analysis: Create Recency, Frequency, and Monetary features for customer segmentation
# We use 'Purchase Amount (USD)' as the Monetary value (total spend),
# 'Previous Purchases' as a proxy for Frequency (average purchase activity),
# and count of 'Review Rating' entries as a proxy for Recency (how recently they engaged).
rfm_data = df.groupby('Customer ID').agg({
'Purchase Amount (USD)': 'sum', # Monetary: Total money spent by each customer
'Previous Purchases': 'mean', # Frequency: Average past purchases (proxy)
'Review Rating': 'count' # Recency proxy: Count of reviews (assumes more reviews = more recent activity)
}).reset_index()
# Rename the columns to standard RFM names
rfm_data.columns = ['Customer_ID', 'Monetary', 'Frequency', 'Recency']
# Invert Recency: Higher counts mean more recent activity,
# but in traditional RFM, *lower* recency values are better, so we invert it
rfm_data['Recency'] = rfm_data['Recency'].max() - rfm_data['Recency'] + 1
# Display sample and shape of the RFM dataset
print("RFM Data Sample:")
print(rfm_data.head())
print(f"\nRFM Data Shape: {rfm_data.shape}")
Output:
Customer_ID |
Monetary |
Frequency |
Recency |
1 | 53 | 14.0 | 1 |
2 | 64 | 2.0 | 1 |
3 | 73 | 23.0 | 1 |
4 | 90 | 49.0 | 1 |
5 | 49 | 31.0 | 1 |
RFM Data Shape: (3900, 4)
In this step, we assign scores to each customer based on Recency, Frequency, and Monetary metrics using quantile-based binning. These scores help segment customers into distinct behavioral groups for targeted strategies.
Here is the Code:
# Create RFM scores using quintiles with duplicate handling
# Use 'drop' to handle duplicate bin edges
try:
# Method 1: Using qcut with duplicate handling
rfm_data['R_Score'] = pd.qcut(rfm_data['Recency'], 5, labels=[5,4,3,2,1], duplicates='drop')
rfm_data['F_Score'] = pd.qcut(rfm_data['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5], duplicates='drop')
rfm_data['M_Score'] = pd.qcut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5], duplicates='drop')
except ValueError as e:
print(f"qcut failed: {e}")
print("Using cut method instead...")
# Method 2: Alternative approach using cut with custom bins
rfm_data['R_Score'] = pd.cut(rfm_data['Recency'], 5, labels=[5,4,3,2,1])
rfm_data['F_Score'] = pd.cut(rfm_data['Frequency'], 5, labels=[1,2,3,4,5])
rfm_data['M_Score'] = pd.cut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5])
# Convert scores to numeric (handle any NaN values)
rfm_data['R_Score'] = pd.to_numeric(rfm_data['R_Score'], errors='coerce')
rfm_data['F_Score'] = pd.to_numeric(rfm_data['F_Score'], errors='coerce')
rfm_data['M_Score'] = pd.to_numeric(rfm_data['M_Score'], errors='coerce')
# Fill any NaN values with median score
rfm_data['R_Score'] = rfm_data['R_Score'].fillna(3)
rfm_data['F_Score'] = rfm_data['F_Score'].fillna(3)
rfm_data['M_Score'] = rfm_data['M_Score'].fillna(3)
# Create combined RFM score
rfm_data['RFM_Score'] = (rfm_data['R_Score'].astype(str) +
rfm_data['F_Score'].astype(str) +
rfm_data['M_Score'].astype(str))
print("RFM Scores Sample:")
print(rfm_data[['Customer_ID', 'R_Score', 'F_Score', 'M_Score', 'RFM_Score']].head())
Output:
Customer_ID |
R_Score |
F_Score |
M_Score |
RFM_Score |
1 | 3 | 2 | 3 | 323 |
2 | 3 | 1 | 3 | 313 |
3 | 3 | 3 | 4 | 334 |
4 | 3 | 5 | 5 | 355 |
5 | 3 | 4 | 2 | 342 |
Now that we have calculated the RFM scores, we can categorize customers into meaningful segments such as Champions, Loyal Customers, At Risk, etc. This helps in tailoring marketing strategies for different customer groups.
Here is the code:
# Define customer segments based on RFM scores
def segment_customers(row):
"""Function to assign customer segments based on RFM scores"""
score = str(row['R_Score']) + str(row['F_Score']) + str(row['M_Score'])
# High value customers
if row['R_Score'] >= 4 and row['F_Score'] >= 4 and row['M_Score'] >= 4:
return 'Champions'
elif row['R_Score'] >= 3 and row['F_Score'] >= 3 and row['M_Score'] >= 3:
return 'Loyal Customers'
elif row['R_Score'] >= 3 and row['F_Score'] <= 2:
return 'New Customers'
elif row['R_Score'] <= 2 and row['F_Score'] >= 3 and row['M_Score'] >= 3:
return 'At Risk'
elif row['R_Score'] <= 2 and row['M_Score'] >= 4:
return 'Cannot Lose Them'
elif row['R_Score'] <= 2 and row['F_Score'] <= 2:
return 'Hibernating'
else:
return 'Others'
# Apply segmentation
rfm_data['Segment'] = rfm_data.apply(segment_customers, axis=1)
# Display segment distribution
segment_counts = rfm_data['Segment'].value_counts()
print("Customer Segments Distribution:")
print(segment_counts)
Output:
Segment |
Count |
New Customers | 1561 |
Loyal Customers | 1362 |
Others | 977 |
To gain intuitive insights from the customer segmentation, let’s visualize the RFM scores and clusters. These plots will help identify high-value customers, dormant segments, and emerging trends in customer behavior.
Here is the code:
# Create visualizations
plt.figure(figsize=(15, 10))
# 1. RFM Score Distribution
plt.subplot(2, 3, 1)
plt.hist(rfm_data_clean['R_Score'].dropna(), bins=5, alpha=0.7, color='red', edgecolor='black')
plt.title('Recency Score Distribution')
plt.xlabel('Recency Score')
plt.ylabel('Frequency')
plt.subplot(2, 3, 2)
plt.hist(rfm_data_clean['F_Score'].dropna(), bins=5, alpha=0.7, color='green', edgecolor='black')
plt.title('Frequency Score Distribution')
plt.xlabel('Frequency Score')
plt.ylabel('Frequency')
plt.subplot(2, 3, 3)
plt.hist(rfm_data_clean['M_Score'].dropna(), bins=5, alpha=0.7, color='blue', edgecolor='black')
plt.title('Monetary Score Distribution')
plt.xlabel('Monetary Score')
plt.ylabel('Frequency')
# 2. Customer Segments Bar Plot
plt.subplot(2, 3, 4)
segment_counts.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Customer Segments Distribution')
plt.xlabel('Segments')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)
# 3. Cluster scatter plot
plt.subplot(2, 3, 5)
colors = ['red', 'blue', 'green', 'orange']
for i in range(optimal_k):
cluster_data = rfm_data_clean[rfm_data_clean['Cluster'] == i]
plt.scatter(cluster_data['Frequency'], cluster_data['Monetary'],
c=colors[i], label=f'Cluster {i}', alpha=0.6)
plt.xlabel('Frequency')
plt.ylabel('Monetary')
plt.title('K-Means Clusters (Frequency vs Monetary)')
plt.legend()
# 4. RFM 3D scatter (using 2D projection)
plt.subplot(2, 3, 6)
plt.scatter(rfm_data_clean['Recency'], rfm_data_clean['Monetary'],
c=rfm_data_clean['Cluster'], cmap='viridis', alpha=0.6)
plt.xlabel('Recency')
plt.ylabel('Monetary')
plt.title('Clusters: Recency vs Monetary')
plt.colorbar()
plt.tight_layout()
plt.show()
# Save results
rfm_data_clean.to_csv('rfm_analysis_results.csv', index=False)
print("\n RFM analysis completed successfully!")
print("Results saved to 'rfm_analysis_results.csv'")
print(f"Total customers analyzed: {len(rfm_data_clean)}")
Output:
RFM analysis completed successfully!
Total customers analyzed: 3900
In this project, we were able to analyze sales data effectively using RFM (Recency, Frequency, Monetary) analysis to determine customer value segments like Champions, Loyal Customers, and At Risk. We obtained actionable insights into customer behavior by integrating EDA, scoring, and clustering. This process enables businesses to make evidence-based marketing and retention choices, ultimately increasing customer lifetime value.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1g2aPLZeAX4R13zaac0ANr-X88D-Sya5e?usp=sharing
779 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources