View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Customer Purchase Behavior Analysis Project Using Python

By Rohit Sharma

Updated on Jul 24, 2025 | 11 min read | 1.36K+ views

Share:

Customer Purchase Behavior Analysis helps businesses get a clear picture of how their customers shop. By breaking down the data, it becomes easier to spot loyal or big-spending customers, find patterns in what people often buy together, and understand what types of shoppers are most valuable to the business. 

In this project, we use Python to perform RFM analysis and K‑Means clustering on an e-commerce store's data to learn these concepts.

If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enroll today!

Spark your next big idea. Browse our full collection of data science projects in Python.

What Should You Know Beforehand?

It’s helpful to have some basic knowledge of the following before starting this project:

Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:

Technologies and Libraries Used

For this Customer Purchase Behavior Analysis project, the following tools and libraries will be used:

Tool / Library

Purpose

Python Programming language
Google Colab Cloud-based notebook for writing and running code
Pandas Data loading, cleaning, and manipulation
NumPy Efficient numerical operations
Matplotlib / Seaborn Data visualization
Scikit‑learn K‑Means clustering and RFM computations

Models That Will Be Utilized for Learning

For our Customer Purchase Behavior Analysis, we’ll leverage three accessible yet powerful techniques:

  • RFM Analysis
    Segments customers based on Recency (how recently they purchased), Frequency (how often), and Monetary (how much they spent), helping identify high‑value and at‑risk groups.
  • K‑Means Clustering
    An unsupervised algorithm that groups customers into clusters with similar RFM profiles, enabling targeted marketing and personalized campaigns.

Time Taken and Difficulty

You can complete this Customer Purchase Behavior Analysis project in about 2 to 3 hours. It’s ideal for beginners to intermediate users, offering hands‑on experience with customer segmentation and clustering using Python.

How to Build a Customer Purchase Behavior Analysis Model

Let’s build this project from scratch with clear, step-by-step guidance:

  1. Loading the Transaction Dataset
  2. Cleaning and Preparing the Data
  3. Exploratory Data Analysis (EDA)
  4. Applying K‑Means Clustering
  5. Mining Association Rules
  6. Visualizing Insights
  7. Drawing Business Recommendations

With these steps, you’ll uncover actionable insights into customer behavior and build a robust analysis workflow using Python.

Without any further delay, let’s get started!

Step 1: Download the Dataset

To analyze customer purchase behavior, we’ll use the Online Retail sample dataset from Kaggle.

Follow the steps below to download the dataset:

  1. Open a new tab in your web browser.
  2. Go to: https://www.kaggle.com/code/xokent/consumer-behavior-and-shopping-habits-clustering/input
  3. Click the Download button to download the dataset as a .zip file.
  4. Once downloaded, extract the ZIP file. You’ll find a file named shopping_behavior_updated.csv.
  5. We’ll use this CSV file for the project.

Now that you’ve downloaded the dataset, let’s move on to the next step, uploading and loading it into Google Colab.

Step 2: Upload and Read the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()

Once uploaded, use the following Python code to read and check the data:

# Step 1: Upload and Read the Dataset in Google Colab

import pandas as pd

df = pd.read_csv('shopping_behavior_updated.csv')

df.head()  # Shows the first 5 rows of the data

Output : 

 

Customer ID

Age

Gender

Item Purchased

Category

Purchase Amount (USD)

Location

Size

Color

Season

Review Rating

Subscription Status

Shipping Type

Discount Applied

Promo Code Used

Previous Purchases

Payment Method

Frequency of Purchases

0

1

55

Male

Blouse

Clothing

53

Kentucky

L

Gray

Winter

3.1

Yes

Express

Yes

Yes

14

Venmo

Fortnightly

1

2

19

Male

Sweater

Clothing

64

Maine

L

Maroon

Winter

3.1

Yes

Express

Yes

Yes

2

Cash

Fortnightly

2

3

50

Male

Jeans

Clothing

73

Massachusetts

S

Maroon

Spring

3.1

Yes

Free Shipping

Yes

Yes

23

Credit Card

Weekly

3

4

21

Male

Sandals

Footwear

90

Rhode Island

M

Maroon

Spring

3.5

Yes

Next Day Air

Yes

Yes

49

PayPal

Weekly

4

5

45

Male

Blouse

Clothing

49

Oregon

M

Turquoise

Spring

2.7

Yes

Free Shipping

Yes

Yes

31

PayPal

Annually

 

Step 3: Clean and Prepare the Data

To ensure clean and reliable analysis, we remove any rows with missing data:

Here is the code:


df.isnull().sum()  # Check for any missing values in the data

# Remove rows that contain any missing (NaN) values

df = df.dropna()

Output

 

0

Customer ID

0

Age

0

Gender

0

Item Purchased

0

Category

0

Purchase Amount (USD)

0

Location

0

Size

0

Color

0

Season

0

Review Rating

0

Subscription Status

0

Shipping Type

0

Discount Applied

0

Promo Code Used

0

Previous Purchases

0

Payment Method

0

Frequency of Purchases

0

Step 4:  Exploratory Data Analysis (EDA)

Understanding the distribution of customer spending is crucial before applying any clustering or segmentation techniques. Let’s begin by visualizing the Purchase Amount (USD).

Here is the code:

 # Set the figure size for better visibility

plt.figure(figsize=(8, 4))

# Plot a histogram of the 'Purchase Amount (USD)' column with KDE

# This helps us see how the data is distributed

sns.histplot(df['Purchase Amount (USD)'], bins=30, kde=True)

# Add a title and labels for clarity

plt.title('Purchase Amount Distribution')

plt.xlabel('Purchase Amount (USD)')

plt.ylabel('Frequency')

# Show the plot

plt.show()

Output: 

Analyze Spending by Category

To understand how spending differs across product categories, we’ll use a box plot. This helps identify:

  • Which categories have the highest average spend
  • Categories with outliers or varied spending

 What We’ll Do:

  • Compare purchase amounts across different product categories using a box plot

Here is the code:

 

# Create a box plot to compare spending across product categories

plt.figure(figsize=(10, 5))

sns.boxplot(x='Category', y='Purchase Amount (USD)', data=df)  # Shows distribution per category

plt.title('Spending by Category')

plt.xlabel('Product Category')

plt.ylabel('Purchase Amount (USD)')

plt.grid(True)

plt.show()

Output: 

Step 5: K-Means Clustering for Customer Segmentation

We’ll now apply K-Means Clustering to group customers based on their purchasing behavior. This helps identify distinct segments like high spenders, frequent buyers, or one-time shoppers.

What We’ll Do:

  • Select relevant features: Purchase Amount (USD) and Frequency of Purchases
  • Use the Elbow Method to choose the best number of clusters
  • Prepare the data by scaling it (not shown here but assumed done in X_scaled)
  • Fit K-Means and plot the inertia to detect the "elbow"

Here is the code:

# Feature selection: Purchase amount and frequency

X = df[['Purchase Amount (USD)', 'Frequency of Purchases']]

# We'll assume X has already been scaled and stored in X_scaled

# This is important because KMeans is sensitive to feature magnitude

inertia = []

# Try clustering with K=1 to 10

for k in range(1, 11):

    kmeans = KMeans(n_clusters=k, random_state=42)

    kmeans.fit(X_scaled)  # Using the scaled version of X

    inertia.append(kmeans.inertia_)  # Save the sum of squared distances (inertia)

# Plot the inertia values to find the 'elbow'

plt.plot(range(1, 11), inertia, marker='o')

plt.xlabel('Number of Clusters')

plt.ylabel('Inertia')

plt.title('Elbow Method For Optimal K')

plt.show()

Output:

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Visualize Customer Segments

Once we know the optimal number of clusters (say, 3 from the elbow method), we use K-Means to assign each customer to a segment.

What We’ll Do:

  • Apply K-Means clustering with 3 clusters
  • Label each customer with a cluster number
  • Visualize the segments in a scatter plot

Here is the code:

# Apply KMeans clustering with 3 clusters (based on elbow method)

kmeans = KMeans(n_clusters=3, random_state=42)

df['Cluster'] = kmeans.fit_predict(X_scaled)  # Assign cluster labels to each customer

# Plot the customer segments

plt.figure(figsize=(8, 6))

sns.scatterplot(x='Purchase Amount (USD)', 

                y='Frequency of Purchases', 

                hue='Cluster', 

                data=df, 

                palette='Set1')

plt.title('K-Means Customer Segments')

plt.xlabel('Purchase Amount (USD)')

plt.ylabel('Frequency of Purchases')

plt.grid(True)

plt.show()

Output:

Step 6: Identify High-Value Customers and Churn Risks

In this step, we use RFM analysis and clustering to spot high-value customers and those at risk of churn, helping businesses focus on retention and personalized marketing.

Here is the code:

# RFM Analysis: Create Recency, Frequency, and Monetary features for customer segmentation

# We use 'Purchase Amount (USD)' as the Monetary value (total spend),

# 'Previous Purchases' as a proxy for Frequency (average purchase activity),

# and count of 'Review Rating' entries as a proxy for Recency (how recently they engaged).

rfm_data = df.groupby('Customer ID').agg({

    'Purchase Amount (USD)': 'sum',        # Monetary: Total money spent by each customer

    'Previous Purchases': 'mean',          # Frequency: Average past purchases (proxy)

    'Review Rating': 'count'               # Recency proxy: Count of reviews (assumes more reviews = more recent activity)

}).reset_index()

# Rename the columns to standard RFM names

rfm_data.columns = ['Customer_ID', 'Monetary', 'Frequency', 'Recency']

# Invert Recency: Higher counts mean more recent activity, 

# but in traditional RFM, *lower* recency values are better, so we invert it

rfm_data['Recency'] = rfm_data['Recency'].max() - rfm_data['Recency'] + 1

# Display sample and shape of the RFM dataset

print("RFM Data Sample:")

print(rfm_data.head())

print(f"\nRFM Data Shape: {rfm_data.shape}")

Output:

Customer_ID

Monetary

Frequency

Recency

1 53 14.0 1
2 64 2.0 1
3 73 23.0 1
4 90 49.0 1
5 49 31.0 1


RFM Data Shape: (3900, 4)

Calculate RFM Scores for Customer Segmentation

In this step, we assign scores to each customer based on Recency, Frequency, and Monetary metrics using quantile-based binning. These scores help segment customers into distinct behavioral groups for targeted strategies.

Here is the Code:


# Create RFM scores using quintiles with duplicate handling

# Use 'drop' to handle duplicate bin edges

try:

    # Method 1: Using qcut with duplicate handling

    rfm_data['R_Score'] = pd.qcut(rfm_data['Recency'], 5, labels=[5,4,3,2,1], duplicates='drop')

    rfm_data['F_Score'] = pd.qcut(rfm_data['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5], duplicates='drop')

    rfm_data['M_Score'] = pd.qcut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5], duplicates='drop')

except ValueError as e:

    print(f"qcut failed: {e}")

    print("Using cut method instead...")

    # Method 2: Alternative approach using cut with custom bins

    rfm_data['R_Score'] = pd.cut(rfm_data['Recency'], 5, labels=[5,4,3,2,1])

    rfm_data['F_Score'] = pd.cut(rfm_data['Frequency'], 5, labels=[1,2,3,4,5])

    rfm_data['M_Score'] = pd.cut(rfm_data['Monetary'], 5, labels=[1,2,3,4,5])

# Convert scores to numeric (handle any NaN values)

rfm_data['R_Score'] = pd.to_numeric(rfm_data['R_Score'], errors='coerce')

rfm_data['F_Score'] = pd.to_numeric(rfm_data['F_Score'], errors='coerce')

rfm_data['M_Score'] = pd.to_numeric(rfm_data['M_Score'], errors='coerce')

# Fill any NaN values with median score

rfm_data['R_Score'] = rfm_data['R_Score'].fillna(3)

rfm_data['F_Score'] = rfm_data['F_Score'].fillna(3)

rfm_data['M_Score'] = rfm_data['M_Score'].fillna(3)

# Create combined RFM score

rfm_data['RFM_Score'] = (rfm_data['R_Score'].astype(str) + 

                         rfm_data['F_Score'].astype(str) + 

                         rfm_data['M_Score'].astype(str))

print("RFM Scores Sample:")

print(rfm_data[['Customer_ID', 'R_Score', 'F_Score', 'M_Score', 'RFM_Score']].head())

Output:

Customer_ID

R_Score

F_Score

M_Score

RFM_Score

1 3 2 3 323
2 3 1 3 313
3 3 3 4 334
4 3 5 5 355
5 3 4 2 342


Segment Customers Based on RFM Scores

Now that we have calculated the RFM scores, we can categorize customers into meaningful segments such as ChampionsLoyal CustomersAt Risk, etc. This helps in tailoring marketing strategies for different customer groups.

Here is the code: 


# Define customer segments based on RFM scores

def segment_customers(row):

    """Function to assign customer segments based on RFM scores"""

    score = str(row['R_Score']) + str(row['F_Score']) + str(row['M_Score'])

    # High value customers

    if row['R_Score'] >= 4 and row['F_Score'] >= 4 and row['M_Score'] >= 4:

        return 'Champions'

    elif row['R_Score'] >= 3 and row['F_Score'] >= 3 and row['M_Score'] >= 3:

        return 'Loyal Customers'

    elif row['R_Score'] >= 3 and row['F_Score'] <= 2:

        return 'New Customers'

    elif row['R_Score'] <= 2 and row['F_Score'] >= 3 and row['M_Score'] >= 3:

        return 'At Risk'

    elif row['R_Score'] <= 2 and row['M_Score'] >= 4:

        return 'Cannot Lose Them'

    elif row['R_Score'] <= 2 and row['F_Score'] <= 2:

        return 'Hibernating'

    else:

        return 'Others'

# Apply segmentation

rfm_data['Segment'] = rfm_data.apply(segment_customers, axis=1)

# Display segment distribution

segment_counts = rfm_data['Segment'].value_counts()

print("Customer Segments Distribution:")

print(segment_counts)

Output:

Segment

Count

New Customers 1561
Loyal Customers 1362
Others 977

Step 7: Visualize Results

To gain intuitive insights from the customer segmentation, let’s visualize the RFM scores and clusters. These plots will help identify high-value customers, dormant segments, and emerging trends in customer behavior.

Here is the code:


# Create visualizations

plt.figure(figsize=(15, 10))

# 1. RFM Score Distribution

plt.subplot(2, 3, 1)

plt.hist(rfm_data_clean['R_Score'].dropna(), bins=5, alpha=0.7, color='red', edgecolor='black')

plt.title('Recency Score Distribution')

plt.xlabel('Recency Score')

plt.ylabel('Frequency')

plt.subplot(2, 3, 2)

plt.hist(rfm_data_clean['F_Score'].dropna(), bins=5, alpha=0.7, color='green', edgecolor='black')

plt.title('Frequency Score Distribution')

plt.xlabel('Frequency Score')

plt.ylabel('Frequency')

plt.subplot(2, 3, 3)

plt.hist(rfm_data_clean['M_Score'].dropna(), bins=5, alpha=0.7, color='blue', edgecolor='black')

plt.title('Monetary Score Distribution')

plt.xlabel('Monetary Score')

plt.ylabel('Frequency')

# 2. Customer Segments Bar Plot

plt.subplot(2, 3, 4)

segment_counts.plot(kind='bar', color='skyblue', edgecolor='black')

plt.title('Customer Segments Distribution')

plt.xlabel('Segments')

plt.ylabel('Number of Customers')

plt.xticks(rotation=45)

# 3. Cluster scatter plot

plt.subplot(2, 3, 5)

colors = ['red', 'blue', 'green', 'orange']

for i in range(optimal_k):

    cluster_data = rfm_data_clean[rfm_data_clean['Cluster'] == i]

    plt.scatter(cluster_data['Frequency'], cluster_data['Monetary'], 

               c=colors[i], label=f'Cluster {i}', alpha=0.6)

plt.xlabel('Frequency')

plt.ylabel('Monetary')

plt.title('K-Means Clusters (Frequency vs Monetary)')

plt.legend()

# 4. RFM 3D scatter (using 2D projection)

plt.subplot(2, 3, 6)

plt.scatter(rfm_data_clean['Recency'], rfm_data_clean['Monetary'], 

           c=rfm_data_clean['Cluster'], cmap='viridis', alpha=0.6)

plt.xlabel('Recency')

plt.ylabel('Monetary')

plt.title('Clusters: Recency vs Monetary')

plt.colorbar()

plt.tight_layout()

plt.show()

# Save results

rfm_data_clean.to_csv('rfm_analysis_results.csv', index=False)

print("\n RFM analysis completed successfully!")

print("Results saved to 'rfm_analysis_results.csv'")

print(f"Total customers analyzed: {len(rfm_data_clean)}")

Output:

RFM analysis completed successfully!

Total customers analyzed: 3900

Conclusion

In this project, we were able to analyze sales data effectively using RFM (Recency, Frequency, Monetary) analysis to determine customer value segments like Champions, Loyal Customers, and At Risk. We obtained actionable insights into customer behavior by integrating EDA, scoring, and clustering. This process enables businesses to make evidence-based marketing and retention choices, ultimately increasing customer lifetime value.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1g2aPLZeAX4R13zaac0ANr-X88D-Sya5e?usp=sharing

Frequently Asked Questions (FAQs)

1. What is RFM analysis, and why is it important?

2. How were Recency, Frequency, and Monetary values calculated in this project?

3. Why did we invert the Recency values?

4. How are RFM scores assigned to customers?

5. What customer segments were identified using RFM scores?

6. What is the benefit of customer segmentation?

7. How was K-means clustering used in this project?

8. What visualizations were created to analyze customer data?

9. How is this analysis useful for marketing teams?

10. Can this RFM model be improved or extended?

11. Where can I apply this RFM + Clustering approach?

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months