Home
Blog
Data Science
Customer Purchase Behavior Analysis Project Using Python

Customer Purchase Behavior Analysis Project Using Python

Q: 1. What is RFM analysis, and why is it important?

RFM (Recency, Frequency, Monetary) analysis is a marketing technique used to identify and segment customers based on their purchasing behavior. It helps businesses understand who their best customers are, who might churn, and where to focus retention or engagement efforts.

Q: 2. How were Recency, Frequency, and Monetary values calculated in this project?

Monetary: Total purchase amount per customer. Frequency: Average of 'Previous Purchases' per customer. Recency: Number of transactions (reviews) used as a proxy, inverted so lower means more recent activity.

Q: 3. Why did we invert the Recency values?

In traditional RFM analysis, lower recency scores indicate more recent activity, which is better. We inverted the transaction count so that customers with more recent interactions receive higher RFM scores.

Q: 4. How are RFM scores assigned to customers?

R, F, and M scores are generated using quintile-based binning (via qcut), where customers are divided into 5 bins and labeled 1 to 5. The scores are then concatenated to form a combined 3-digit RFM score.

Q: 5. What customer segments were identified using RFM scores?

Based on RFM combinations, customers were classified into segments such as: - Champions - Loyal Customers - New Customers - At Risk - Hibernating - Cannot Lose Them - Others

Q: 6. What is the benefit of customer segmentation?

Segmentation helps businesses personalize marketing strategies, improve customer retention, increase engagement, and allocate resources effectively to high-value customers.

Q: 7. How was K-means clustering used in this project?

K-means clustering was applied to the scaled RFM features to automatically group customers into clusters based on behavioral similarity. The elbow method was used to determine the optimal number of clusters (k=4).

Q: 8. What visualizations were created to analyze customer data?

Several visualizations were included: - Histograms for R, F, M score distribution - Bar chart for customer segments - Scatter plots for clusters based on Frequency, Monetary, and Recency

Q: 9. How is this analysis useful for marketing teams?

Marketing teams can use these insights to: - Target “Champions” with exclusive offers - Re-engage “At Risk” customers - Welcome and nurture “New Customers” - Reduce churn by understanding behavior patterns

Q: 10. Can this RFM model be improved or extended?

Yes. Improvements include: - Adding time-decay weighting for recency - Using demographic or behavioral variables - Applying machine learning models for churn prediction or lifetime value forecasting

By Rohit Sharma

Updated on Jul 24, 2025 | 11 min read | 1.36K+ views

Customer Purchase Behavior Analysis helps businesses get a clear picture of how their customers shop. By breaking down the data, it becomes easier to spot loyal or big-spending customers, find patterns in what people often buy together, and understand what types of shoppers are most valuable to the business.

In this project, we use Python to perform RFM analysis and K‑Means clustering on an e-commerce store's data to learn these concepts.

If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enroll today!

Spark your next big idea. Browse our full collection of data science projects in Python.

What Should You Know Beforehand?

It’s helpful to have some basic knowledge of the following before starting this project:

Python programming (variables, functions, loops, basic syntax)
Pandas and Numpy (data loading, cleaning, and numerical operations)
Matplotlib or Seaborn (data visualization)
Scikit‑learn basics (model fitting and evaluation)
Fundamentals of clustering and association rules (conceptual knowledge)

Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:

Technologies and Libraries Used

For this Customer Purchase Behavior Analysis project, the following tools and libraries will be used:

Tool / Library	Purpose
Python	Programming language
Google Colab	Cloud-based notebook for writing and running code
Pandas	Data loading, cleaning, and manipulation
NumPy	Efficient numerical operations
Matplotlib / Seaborn	Data visualization
Scikit‑learn	K‑Means clustering and RFM computations

Models That Will Be Utilized for Learning

For our Customer Purchase Behavior Analysis, we’ll leverage three accessible yet powerful techniques:

RFM Analysis
Segments customers based on Recency (how recently they purchased), Frequency (how often), and Monetary (how much they spent), helping identify high‑value and at‑risk groups.
K‑Means Clustering
An unsupervised algorithm that groups customers into clusters with similar RFM profiles, enabling targeted marketing and personalized campaigns.

Time Taken and Difficulty

You can complete this Customer Purchase Behavior Analysis project in about 2 to 3 hours. It’s ideal for beginners to intermediate users, offering hands‑on experience with customer segmentation and clustering using Python.

How to Build a Customer Purchase Behavior Analysis Model

Let’s build this project from scratch with clear, step-by-step guidance:

Loading the Transaction Dataset
Cleaning and Preparing the Data
Exploratory Data Analysis (EDA)
Applying K‑Means Clustering
Mining Association Rules
Visualizing Insights
Drawing Business Recommendations

With these steps, you’ll uncover actionable insights into customer behavior and build a robust analysis workflow using Python.

Without any further delay, let’s get started!

Step 1: Download the Dataset

To analyze customer purchase behavior, we’ll use the Online Retail sample dataset from Kaggle.

Follow the steps below to download the dataset:

Open a new tab in your web browser.
Go to: https://www.kaggle.com/code/xokent/consumer-behavior-and-shopping-habits-clustering/input
Click the Download button to download the dataset as a .zip file.
Once downloaded, extract the ZIP file. You’ll find a file named shopping_behavior_updated.csv.
We’ll use this CSV file for the project.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Upload and Read the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, use the following Python code to read and check the data:

# Step 1: Upload and Read the Dataset in Google Colab

import pandas as pd

df = pd.read_csv('shopping_behavior_updated.csv')

df.head()  # Shows the first 5 rows of the data

Output :

Customer ID

Age

Gender

Item Purchased

Category

Purchase Amount (USD)

Location

Size

Color

Season

Review Rating

Subscription Status

Shipping Type

Discount Applied

Promo Code Used

Previous Purchases

Payment Method

Frequency of Purchases

Male

Blouse

Clothing

Kentucky

Gray

Winter

3.1

Yes

Express

Yes

Venmo

Fortnightly

Male

Sweater

Clothing

Maine

Maroon

Winter

3.1

Yes

Express

Yes

Cash

Fortnightly

Male

Jeans

Clothing

Massachusetts

Maroon

Spring

3.1

Yes

Free Shipping

Yes

Credit Card

Weekly

Male

Sandals

Footwear

Rhode Island

Maroon

Spring

3.5

Yes

Next Day Air

Yes

PayPal

Weekly

Male

Blouse

Clothing

Oregon

Turquoise

Spring

2.7

Yes

Free Shipping

Yes

PayPal

Annually

Step 3: Clean and Prepare the Data

To ensure clean and reliable analysis, we remove any rows with missing data:

Here is the code:


df.isnull().sum()  # Check for any missing values in the data

# Remove rows that contain any missing (NaN) values

df = df.dropna()

Output

	0
Customer ID	0
Age	0
Gender	0
Item Purchased	0
Category	0
Purchase Amount (USD)	0
Location	0
Size	0
Color	0
Season	0
Review Rating	0
Subscription Status	0
Shipping Type	0
Discount Applied	0
Promo Code Used	0
Previous Purchases	0
Payment Method	0
Frequency of Purchases	0

Step 4: Exploratory Data Analysis (EDA)

Understanding the distribution of customer spending is crucial before applying any clustering or segmentation techniques. Let’s begin by visualizing the Purchase Amount (USD).

Here is the code:

 # Set the figure size for better visibility

plt.figure(figsize=(8, 4))

# Plot a histogram of the 'Purchase Amount (USD)' column with KDE

# This helps us see how the data is distributed

sns.histplot(df['Purchase Amount (USD)'], bins=30, kde=True)

# Add a title and labels for clarity

plt.title('Purchase Amount Distribution')

plt.xlabel('Purchase Amount (USD)')

plt.ylabel('Frequency')

# Show the plot

plt.show()

Output:

Popular Data Science Programs

MS in Data Science PG Diploma in Data Science MSc in Data Science Program DevOps Course Online Data Science Advanced Course

Analyze Spending by Category

To understand how spending differs across product categories, we’ll use a box plot. This helps identify:

Which categories have the highest average spend
Categories with outliers or varied spending

What We’ll Do:

Compare purchase amounts across different product categories using a box plot

Here is the code:

# Create a box plot to compare spending across product categories

plt.figure(figsize=(10, 5))

sns.boxplot(x='Category', y='Purchase Amount (USD)', data=df)  # Shows distribution per category

plt.title('Spending by Category')

plt.xlabel('Product Category')

plt.ylabel('Purchase Amount (USD)')

plt.grid(True)

plt.show()

Output:

Step 5: K-Means Clustering for Customer Segmentation

We’ll now apply K-Means Clustering to group customers based on their purchasing behavior. This helps identify distinct segments like high spenders, frequent buyers, or one-time shoppers.

What We’ll Do:

Select relevant features: Purchase Amount (USD) and Frequency of Purchases
Use the Elbow Method to choose the best number of clusters
Prepare the data by scaling it (not shown here but assumed done in X_scaled)
Fit K-Means and plot the inertia to detect the "elbow"

Here is the code:

# Feature selection: Purchase amount and frequency

X = df[['Purchase Amount (USD)', 'Frequency of Purchases']]

# We'll assume X has already been scaled and stored in X_scaled

# This is important because KMeans is sensitive to feature magnitude

inertia = []

# Try clustering with K=1 to 10

for k in range(1, 11):

    kmeans = KMeans(n_clusters=k, random_state=42)

    kmeans.fit(X_scaled)  # Using the scaled version of X

    inertia.append(kmeans.inertia_)  # Save the sum of squared distances (inertia)

# Plot the inertia values to find the 'elbow'

plt.plot(range(1, 11), inertia, marker='o')

plt.xlabel('Number of Clusters')

plt.ylabel('Inertia')

plt.title('Elbow Method For Optimal K')

plt.show()

Output:

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Visualize Customer Segments

Once we know the optimal number of clusters (say, 3 from the elbow method), we use K-Means to assign each customer to a segment.

What We’ll Do:

Apply K-Means clustering with 3 clusters
Label each customer with a cluster number
Visualize the segments in a scatter plot

Here is the code:

# Apply KMeans clustering with 3 clusters (based on elbow method)

kmeans = KMeans(n_clusters=3, random_state=42)

df['Cluster'] = kmeans.fit_predict(X_scaled)  # Assign cluster labels to each customer

# Plot the customer segments

plt.figure(figsize=(8, 6))

sns.scatterplot(x='Purchase Amount (USD)', 

                y='Frequency of Purchases', 

                hue='Cluster', 

                data=df, 

                palette='Set1')

plt.title('K-Means Customer Segments')

plt.xlabel('Purchase Amount (USD)')

plt.ylabel('Frequency of Purchases')

plt.grid(True)

plt.show()

Output:

Step 6: Identify High-Value Customers and Churn Risks

In this step, we use RFM analysis and clustering to spot high-value customers and those at risk of churn, helping businesses focus on retention and personalized marketing.

Here is the code:

# RFM Analysis: Create Recency, Frequency, and Monetary features for customer segmentation

# We use 'Purchase Amount (USD)' as the Monetary value (total spend),

# 'Previous Purchases' as a proxy for Frequency (average purchase activity),

# and count of 'Review Rating' entries as a proxy for Recency (how recently they engaged).

rfm_data = df.groupby('Customer ID').agg({

    'Purchase Amount (USD)': 'sum',        # Monetary: Total money spent by each customer

    'Previous Purchases': 'mean',          # Frequency: Average past purchases (proxy)

    'Review Rating': 'count'               # Recency proxy: Count of reviews (assumes more reviews = more recent activity)

}).reset_index()

# Rename the columns to standard RFM names

rfm_data.columns = ['Customer_ID', 'Monetary', 'Frequency', 'Recency']

# Invert Recency: Higher counts mean more recent activity, 

# but in traditional RFM, *lower* recency values are better, so we invert it

rfm_data['Recency'] = rfm_data['Recency'].max() - rfm_data['Recency'] + 1

# Display sample and shape of the RFM dataset

print("RFM Data Sample:")

print(rfm_data.head())

print(f"\nRFM Data Shape: {rfm_data.shape}")

Output:

Customer_ID	Monetary	Frequency	Recency
1	53	14.0	1
2	64	2.0	1
3	73	23.0	1
4	90	49.0	1
5	49	31.0	1

RFM Data Shape: (3900, 4)

Calculate RFM Scores for Customer Segmentation

In this step, we assign scores to each customer based on Recency, Frequency, and Monetary metrics using quantile-based binning. These scores help segment customers into distinct behavioral groups for targeted strategies.

Here is the Code:


# Create RFM scores using quintiles with duplicate handling

# Use 'drop' to handle duplicate bin edges

try:

    # Method 1: Using qcut with duplicate handling

    rfm_data['R_Score'] = pd.qcut(rfm_data['Recency'], 5, labels=[5,4,3,2,1], duplicates='drop')

    rfm_data['F_Score'] = pd.qcut(rfm_data['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5], duplicates='drop')

    rfm_data['M_Score'] = pd.qcut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5], duplicates='drop')

except ValueError as e:

    print(f"qcut failed: {e}")

    print("Using cut method instead...")

    # Method 2: Alternative approach using cut with custom bins

    rfm_data['R_Score'] = pd.cut(rfm_data['Recency'], 5, labels=[5,4,3,2,1])

    rfm_data['F_Score'] = pd.cut(rfm_data['Frequency'], 5, labels=[1,2,3,4,5])

    rfm_data['M_Score'] = pd.cut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5])

# Convert scores to numeric (handle any NaN values)

rfm_data['R_Score'] = pd.to_numeric(rfm_data['R_Score'], errors='coerce')

rfm_data['F_Score'] = pd.to_numeric(rfm_data['F_Score'], errors='coerce')

rfm_data['M_Score'] = pd.to_numeric(rfm_data['M_Score'], errors='coerce')

# Fill any NaN values with median score

rfm_data['R_Score'] = rfm_data['R_Score'].fillna(3)

rfm_data['F_Score'] = rfm_data['F_Score'].fillna(3)

rfm_data['M_Score'] = rfm_data['M_Score'].fillna(3)

# Create combined RFM score

rfm_data['RFM_Score'] = (rfm_data['R_Score'].astype(str) + 

                         rfm_data['F_Score'].astype(str) + 

                         rfm_data['M_Score'].astype(str))

print("RFM Scores Sample:")

print(rfm_data[['Customer_ID', 'R_Score', 'F_Score', 'M_Score', 'RFM_Score']].head())

Output:

Customer_ID	R_Score	F_Score	M_Score	RFM_Score
1	3	2	3	323
2	3	1	3	313
3	3	3	4	334
4	3	5	5	355
5	3	4	2	342

Segment Customers Based on RFM Scores

Now that we have calculated the RFM scores, we can categorize customers into meaningful segments such as Champions, Loyal Customers, At Risk, etc. This helps in tailoring marketing strategies for different customer groups.

Here is the code:


# Define customer segments based on RFM scores

def segment_customers(row):

    """Function to assign customer segments based on RFM scores"""

    score = str(row['R_Score']) + str(row['F_Score']) + str(row['M_Score'])

    # High value customers

    if row['R_Score'] >= 4 and row['F_Score'] >= 4 and row['M_Score'] >= 4:

        return 'Champions'

    elif row['R_Score'] >= 3 and row['F_Score'] >= 3 and row['M_Score'] >= 3:

        return 'Loyal Customers'

    elif row['R_Score'] >= 3 and row['F_Score'] <= 2:

        return 'New Customers'

    elif row['R_Score'] <= 2 and row['F_Score'] >= 3 and row['M_Score'] >= 3:

        return 'At Risk'

    elif row['R_Score'] <= 2 and row['M_Score'] >= 4:

        return 'Cannot Lose Them'

    elif row['R_Score'] <= 2 and row['F_Score'] <= 2:

        return 'Hibernating'

    else:

        return 'Others'

# Apply segmentation

rfm_data['Segment'] = rfm_data.apply(segment_customers, axis=1)

# Display segment distribution

segment_counts = rfm_data['Segment'].value_counts()

print("Customer Segments Distribution:")

print(segment_counts)

Output:

Segment	Count
New Customers	1561
Loyal Customers	1362
Others	977

Step 7: Visualize Results

To gain intuitive insights from the customer segmentation, let’s visualize the RFM scores and clusters. These plots will help identify high-value customers, dormant segments, and emerging trends in customer behavior.

Here is the code:


# Create visualizations

plt.figure(figsize=(15, 10))

# 1. RFM Score Distribution

plt.subplot(2, 3, 1)

plt.hist(rfm_data_clean['R_Score'].dropna(), bins=5, alpha=0.7, color='red', edgecolor='black')

plt.title('Recency Score Distribution')

plt.xlabel('Recency Score')

plt.ylabel('Frequency')

plt.subplot(2, 3, 2)

plt.hist(rfm_data_clean['F_Score'].dropna(), bins=5, alpha=0.7, color='green', edgecolor='black')

plt.title('Frequency Score Distribution')

plt.xlabel('Frequency Score')

plt.ylabel('Frequency')

plt.subplot(2, 3, 3)

plt.hist(rfm_data_clean['M_Score'].dropna(), bins=5, alpha=0.7, color='blue', edgecolor='black')

plt.title('Monetary Score Distribution')

plt.xlabel('Monetary Score')

plt.ylabel('Frequency')

# 2. Customer Segments Bar Plot

plt.subplot(2, 3, 4)

segment_counts.plot(kind='bar', color='skyblue', edgecolor='black')

plt.title('Customer Segments Distribution')

plt.xlabel('Segments')

plt.ylabel('Number of Customers')

plt.xticks(rotation=45)

# 3. Cluster scatter plot

plt.subplot(2, 3, 5)

colors = ['red', 'blue', 'green', 'orange']

for i in range(optimal_k):

    cluster_data = rfm_data_clean[rfm_data_clean['Cluster'] == i]

    plt.scatter(cluster_data['Frequency'], cluster_data['Monetary'], 

               c=colors[i], label=f'Cluster {i}', alpha=0.6)

plt.xlabel('Frequency')

plt.ylabel('Monetary')

plt.title('K-Means Clusters (Frequency vs Monetary)')

plt.legend()

# 4. RFM 3D scatter (using 2D projection)

plt.subplot(2, 3, 6)

plt.scatter(rfm_data_clean['Recency'], rfm_data_clean['Monetary'], 

           c=rfm_data_clean['Cluster'], cmap='viridis', alpha=0.6)

plt.xlabel('Recency')

plt.ylabel('Monetary')

plt.title('Clusters: Recency vs Monetary')

plt.colorbar()

plt.tight_layout()

plt.show()

# Save results

rfm_data_clean.to_csv('rfm_analysis_results.csv', index=False)

print("\n RFM analysis completed successfully!")

print("Results saved to 'rfm_analysis_results.csv'")

print(f"Total customers analyzed: {len(rfm_data_clean)}")

Output:

RFM analysis completed successfully!

Total customers analyzed: 3900

Conclusion

In this project, we were able to analyze sales data effectively using RFM (Recency, Frequency, Monetary) analysis to determine customer value segments like Champions, Loyal Customers, and At Risk. We obtained actionable insights into customer behavior by integrating EDA, scoring, and clustering. This process enables businesses to make evidence-based marketing and retention choices, ultimately increasing customer lifetime value.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Data Science Career Path: A Comprehensive Career Guide	Data Science Career Growth: The Future of Work is here	Why is Data Science Important? 8 Ways Data Science Brings Value to the Business
Relevance of Data Science for Managers	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1g2aPLZeAX4R13zaac0ANr-X88D-Sya5e?usp=sharing

Frequently Asked Questions (FAQs)

1. What is RFM analysis, and why is it important?

2. How were Recency, Frequency, and Monetary values calculated in this project?

3. Why did we invert the Recency values?

4. How are RFM scores assigned to customers?

5. What customer segments were identified using RFM scores?

6. What is the benefit of customer segmentation?

7. How was K-means clustering used in this project?

8. What visualizations were created to analyze customer data?

9. How is this analysis useful for marketing teams?

10. Can this RFM model be improved or extended?

11. Where can I apply this RFM + Clustering approach?

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources