Home
Blog
Data Science
Customer Segmentation Project Using R: A Step-by-Step Guide

Customer Segmentation Project Using R: A Step-by-Step Guide

Updated on Jul 28, 2025 | 11 min read | 1.88K+ views

Table of Contents

View all

Things You Should Know Before Starting the Project
Project Duration, Difficulty, and Skill Level Required For This Project
What Are The Tools and R Libraries Required For This Customer Segmentation Project
Detailed Explanation of This Customer Segmentation Project Using R
Conclusion

Businesses use customer segmentation to group customers based on similar behavior and traits. In this customer segmentation project using R, we'll perform customer segmentation using K-means clustering.

We'll work with various R libraries like dplyr, ggplot2, and stats to clean, analyze, and visualize customer patterns using the data we have.

This project will help you understand how to divide your customer base into relevant segments, which can further be used for targeted marketing, improved service, and profitable business decisions.

Redefine your future with upGrad’s cutting-edge Data Science and AI programs. Dive into Generative AI, Machine Learning, and Advanced Analytics—100% online, built for tomorrow’s leaders. Get started now.

Want More R Projects? Click Here: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Things You Should Know Before Starting the Project

There are various concepts you need to be familiar with before starting this project. Some of the most important ones are listed below.

Having a basic knowledge of R programming this includes variables, data frames, and functions.
You need to be familiar with data manipulation using dplyr
Basic understanding of data visualization with ggplot2 is helpful
Knowing the concept of clustering, especially the K-means algorithm, helps a lot
Some basic ideas of scaling and normalizing data
You should know how to install and use R packages in Google Colab or RStudio
Basic understanding of CSV files and data formats is necessary

Step into the future of data with upGrad’s next-gen programs in Generative AI and Data Science. Build real-world expertise, earn industry-recognized credentials, and stay ahead in the AI-powered era. Your journey begins now.

Project Duration, Difficulty, and Skill Level Required For This Project

Estimated Duration: 2 to 4 hours. This mainly depends on your familiarity with R and clustering concepts.
Difficulty Level: Easy to Moderate. This project is ideal for beginners with basic R knowledge.
Skill Level Required:
- Basic R programming
- Introductory understanding of data preprocessing
- Beginner-level knowledge of clustering (K-means)
- No prior experience with machine learning required

What Are The Tools and R Libraries Required For This Customer Segmentation Project

To make sure your project runs smoothly, you’ll need an army of tools and libraries to run the code for you. The tools and libraries required for this customer segmentation project using R are listed in the table below:

Category	Tool / Library	Purpose
Programming Language	R	Main language used for analysis and modeling
Platform	Google Colab / RStudio	Environment to write and run R code
Data Format	CSV	Input data file format
Library	dplyr	Data manipulation and cleaning
Library	ggplot2	Data visualization
Library	stats (built-in)	Performing K-means clustering
Library	readr	Reading CSV files
Optional Library	scales	Enhancing plot labels and formatting

Kickstart your analytics journey with our free Introduction to Data Analysis using Excel course. Learn to clean, analyze, and visualize data like a pro

Detailed Explanation of This Customer Segmentation Project Using R

Step 1: Downloading the Dataset and Setting Up Google Colab for R

To begin, you need a dataset that contains customer-related information such as age, income, gender, and spending behavior. Websites like Kaggle offer free datasets you can download.

Setting Up Google Colab for R

Google Colab uses Python by default, so you need to switch the runtime to R:

Go to Google Colab
Click on Runtime > Change runtime type
Change the Runtime type to R
Click Save

Step 2: Installing Required Libraries in R

Before analyzing the data, we need to install the necessary R libraries. These libraries will help with functions like data manipulation, visualization, clustering, and file handling. The code for this step is given in the code block below:

# Installing essential libraries
install.packages("dplyr")        # For data manipulation
install.packages("ggplot2")      # For creating visualizations
install.packages("factoextra")   # For visualizing clustering results
install.packages("readr")        # For reading CSV files into R

Step 3: Loading Libraries and Reading the Dataset

In this step, we will load the libraries we installed and then read the dataset into R. This will help us start working with the customer data. The code for this section is given in the code block below:

# Load required libraries
library(dplyr)        # For data manipulation
library(ggplot2)      # For plotting graphs
library(factoextra)   # For visualizing clusters
library(readr)        # For reading CSV files

# Read the uploaded CSV file into a data frame
data <- read_csv("Customer Data.csv")

# Display the first few rows of the dataset
head(data)

Popular Data Science Programs

Data Science Machine Learning Course M Sc in Data Science Degree Advanced Certificate Program in Data Science PG Diploma in Data Science DevOps Full Course Online

The code for this section will return the output of the dataset.

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
<chr>	<chr>	<chr>	<dbl>	<chr>	<dbl>	<dbl>	<chr>
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850	United Kingdom
536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850	United Kingdom
536365	22752	SET 7 BABUSHKA NESTING BOXES	2	12/1/2010 8:26	7.65	17850	United Kingdom

Uncover hidden patterns with ease! Master K-Means and more in this free Unsupervised Learning course. Enrol now and start clustering like a pro.

Step 4: Exploring the Dataset Structure and Quality

This step will help us understand the structure of the dataset. We will get basic statistics for each column and check for any missing values. The code for this section is given in the code block below:

# Check the structure of the dataset: column names and data types
str(data)

# Get summary statistics like mean, min, max for each column
summary(data)

# Check for missing values in each column
colSums(is.na(data))

The above code returns the output:

InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0

This means that:

Description has 1454 missing values
CustomerID has a very large number of missing values: 135,080
All other columns have 0 missing values

Step 5: Creating Customer-Level Features for Segmentation

In this step, we will transform raw transaction data into aggregated customer-level data. The code for this section is given in the code block below:

library(dplyr)

# Create a new column for total price of each transaction
data <- data %>%
  mutate(TotalPrice = Quantity * UnitPrice)

# Remove rows where CustomerID is missing
data <- data %>%
  filter(!is.na(CustomerID))

# Group data by CustomerID and calculate key features:
# - Total quantity purchased
# - Total amount spent
# - Average unit price per item
# - Number of transactions made
customer_data <- data %>%
  group_by(CustomerID) %>%
  summarise(
    TotalQuantity = sum(Quantity, na.rm = TRUE),
    TotalSpent = sum(TotalPrice, na.rm = TRUE),
    AvgUnitPrice = mean(UnitPrice, na.rm = TRUE),
    NumTransactions = n()
  )

# View the first few rows of the summarized customer data
head(customer_data)

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

The output of this section:

CustomerID	TotalQuantity	TotalSpent	AvgUnitPrice	NumTransactions
<dbl>	<dbl>	<dbl>	<dbl>	<int>
12346	0	0.00	1.040000	2
12347	2458	4310.00	2.644011	182
12348	2341	1797.24	5.764839	31
12349	631	1757.55	8.289041	73
12350	197	334.40	3.841176	17
12352	470	1545.41	23.274737	95

Click to Learn More About: R For Data Science: Why Should You Choose R for Data Science?

Step 6: Preparing the Data for Clustering

Before applying clustering algorithms, it's important to remove non-numeric identifiers and normalize the features. The code for this step is given in the code block below.

# Remove CustomerID as it's just an identifier and not useful for clustering
customer_data_numeric <- customer_data %>%
  select(-CustomerID)

# Normalize the numeric features so they have mean = 0 and standard deviation = 1
normalized_data <- scale(customer_data_numeric)

# View the first few rows of the normalized data
head(normalized_data)

TotalQuantity	TotalSpent	AvgUnitPrice	NumTransactions
-0.2401871	-0.23097457	-0.047864577	-0.391674900
0.2858369	0.29339811	-0.036799633	0.382613202
0.2607983	-0.01231481	-0.015271237	-0.266928483
-0.1051500	-0.01714367	0.002141461	-0.086261260
-0.1980281	-0.19029006	-0.028541230	-0.327150891
-0.1396048	-0.04295351	0.105517240	0.008373953

Step 7: Finding the Optimal Number of Clusters (K)

We apply K-means we can cluster the data effectively. We need to choose the right number of clusters (k). The Elbow Method helps identify the optimal value by plotting the within-cluster sum of squares (WSS) for different values of k. The code for this section is given in the code block below:

# Install and load factoextra if not already done
install.packages("factoextra")
library(factoextra)

# Use the Elbow Method to visualize how WSS changes with different k values
fviz_nbclust(normalized_data, kmeans, method = "wss") +
  labs(title = "Elbow Method for Choosing K")

The output for this step is given below:

Must Read: Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices

Step 8: Applying K-Means Clustering to Segment Customers

After determining the optimal number of clusters (k = 4), we apply the K-means algorithm to group customers into segments. This step will assign a cluster label to each customer based on their behavior. The code for this section is given in the code block below:

# Apply K-means clustering with 4 clusters
set.seed(123)  # Ensures consistent results every time you run the code
kmeans_result <- kmeans(normalized_data, centers = 4, nstart = 25)

# Add the cluster label (1 to 4) back to the original customer data
customer_data$Cluster <- as.factor(kmeans_result$cluster)

# View the first few rows of the customer data with cluster assignments
head(customer_data)

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

The output of the above code is:

CustomerID	TotalQuantity	TotalSpent	AvgUnitPrice	NumTransactions	Cluster
<dbl>	<dbl>	<dbl>	<dbl>	<int>	<fct>
12346	0	0.00	1.040000	2	2
12347	2458	4310.00	2.644011	182	2
12348	2341	1797.24	5.764839	31	2
12349	631	1757.55	8.289041	73	2
12350	197	334.40	3.841176	17	2
12352	470	1545.41	23.274737	95	2

The output now includes a new column called Cluster. This column represents the customer segment (from 1 to 4) assigned by the K-means algorithm.

Step 9: Visualizing Customer Segments

In this step, we will generate a 2D scatter plot to visually explore how customers are grouped into different clusters. The code for this section is given in the code block below:

# Visualize the K-means clustering result
fviz_cluster(kmeans_result,
             data = normalized_data,
             geom = "point",          # Show each customer as a point
             ellipse.type = "norm",   # Add normal-shaped boundary around clusters
             palette = "jco",         # Use a clean color palette
             ggtheme = theme_minimal()) +  # Minimalist theme for clarity
  labs(title = "Customer Segmentation using K-Means")  # Add plot title

Also Read: Data Science Project Ideas for Beginners | Python IDEs for Data Science and Machine Learning

Step 10: Profiling Customer Segments with Cluster Summary

This step will summarize each cluster to help us understand the common characteristics of customers in each segment. The code for this section is:

# Load dplyr if not already loaded
library(dplyr)

# Group customers by cluster and calculate average values for each segment
customer_data %>%
  group_by(Cluster) %>%
  summarise(
    Avg_TotalQuantity = mean(TotalQuantity),       # Average quantity bought per cluster
    Avg_TotalSpent = mean(TotalSpent),             # Average spending per cluster
    Avg_UnitPrice = mean(AvgUnitPrice),            # Average price per item
    Avg_NumTransactions = mean(NumTransactions),   # Average number of purchases
    Count = n()                                     # Number of customers in each cluster
  )

The output for the above code is:

Cluster	Avg_TotalQuantity	Avg_TotalSpent	Avg_UnitPrice	Avg_NumTransactions	Count
<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
1	5101.3312	8593.7480	3.788570	417.858	317
2	590.4525	984.5275	5.268309	58.868	4038
3	29.5000	-1819.0650	6171.705000	3.000	2
4	60364.0000	106930.9267	4.310349	2443.533	15

This table gives you a clear breakdown of customer behavior in each cluster.

Key Observations of This Output:

Cluster 1 and 4 are likely your most valuable customers, especially Cluster 4, though it has very few people.
Cluster 2 includes the bulk of customers and reflects average spending.
Cluster 3 seems to contain outliers or problematic data (very high price, negative spend, very low count).

Step 11: Visualizing Average Spending Across Customer Segments

This step creates a simple bar chart to compare the average total spending for each customer cluster. The code for this section is given in the code block below:

# Create a bar plot showing average total spending per customer cluster
ggplot(customer_data, aes(x = Cluster, y = TotalSpent, fill = Cluster)) +
  stat_summary(fun = mean, geom = "bar") +                 # Plot mean values as bars
  labs(title = "Average Total Spending per Cluster",       # Add title and axis label
       y = "Avg Total Spent") +
  theme_minimal()                                          # Use a clean, minimal theme

The output for the above code is given below:

Here’s Something You Should Know: What is Data Wrangling? Exploring Its Role in Data Analysis

Step 12: Visualizing Average Number of Transactions per Cluster

This step creates a bar chart that shows the average number of transactions made by customers in each cluster. The code for this section is given in the code block below:

# Bar plot showing average number of transactions per cluster
ggplot(customer_data, aes(x = Cluster, y = NumTransactions, fill = Cluster)) +
  stat_summary(fun = mean, geom = "bar") +                      # Show mean as bars
  labs(title = "Average Transactions per Cluster",              # Add plot title
       y = "Avg Transactions") +
  theme_minimal()                                               # Use a clean visual style

The output for the above code is given below.

Conclusion

In this customer segmentation project using R, we successfully performed customer segmentation and the K-means clustering algorithm. By using tools like Google Colab and libraries such as dplyr, ggplot2, factoextra, and readr, we cleaned and transformed raw transaction data into meaningful customer insights.

We were able to identify four distinct customer segments based on total spending, purchase frequency, and quantity.

This customer segmentation project helped us learn key concepts in data preprocessing, clustering, and visualization. The results can be used for targeted marketing, improving customer experience, and making better business decisions based on data.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/14QGFDhHQmYYN-M3luTyi8uAbEFMqx368#scrollTo=4lFIYq1r3qp8

Frequently Asked Questions (FAQs)

1. What is customer segmentation, and why is it important for businesses?

Customer segmentation is the process of dividing customers into distinct groups based on behavior, spending patterns, or demographics. It helps businesses tailor marketing strategies, improve customer retention, and boost overall sales by targeting the right audience effectively.

2. Which R libraries and tools are commonly used for customer segmentation projects?

Popular R libraries used include dplyr for data manipulation, ggplot2 for visualization, factoextra for clustering insights, and readr for importing CSV data. Google Colab or RStudio can be used as the development environment.

3. Can I use other algorithms besides K-Means for customer segmentation?

Yes, other clustering algorithms such as Hierarchical Clustering, DBSCAN, or Gaussian Mixture Models can be used based on the nature of the dataset. You can also explore dimensionality reduction techniques like PCA to improve clustering accuracy.

4. What skills will I develop by completing this customer segmentation project in R?

You’ll gain hands-on experience with data preprocessing, feature engineering, clustering techniques, visual analysis, and business-oriented insights generation. It’s a great way to strengthen your applied data science skills using R.

5. What are some alternative data science projects I can try after this?

Here are a few beginner-to-intermediate level R projects to explore next:

Music Recommendation System using collaborative filtering
Sales Forecasting with Time Series Models
Market Basket Analysis using Apriori or FP-Growth
Ensemble Learning Models in R (like bagging and boosting)
Customer Churn Prediction using classification algorithms

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources