Customer Segmentation Project Using R: A Step-by-Step Guide
By Rohit Sharma
Updated on Jul 25, 2025 | 11 min read | 1.53K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 25, 2025 | 11 min read | 1.53K+ views
Share:
Table of Contents
Businesses use customer segmentation to group customers based on similar behavior and traits. In this customer segmentation project using R, we'll perform customer segmentation using K-means clustering.
We'll work with various R libraries like dplyr, ggplot2, and stats to clean, analyze, and visualize customer patterns using the data we have.
This project will help you understand how to divide your customer base into relevant segments, which can further be used for targeted marketing, improved service, and profitable business decisions.
Redefine your future with upGrad’s cutting-edge Data Science and AI programs. Dive into Generative AI, Machine Learning, and Advanced Analytics—100% online, built for tomorrow’s leaders. Get started now.
Want More R Projects? Click Here: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
There are various concepts you need to be familiar with before starting this project. Some of the most important ones are listed below.
Step into the future of data with upGrad’s next-gen programs in Generative AI and Data Science. Build real-world expertise, earn industry-recognized credentials, and stay ahead in the AI-powered era. Your journey begins now.
To make sure your project runs smoothly, you’ll need an army of tools and libraries to run the code for you. The tools and libraries required for this customer segmentation project using R are listed in the table below:
Category |
Tool / Library |
Purpose |
Programming Language | R | Main language used for analysis and modeling |
Platform | Google Colab / RStudio | Environment to write and run R code |
Data Format | CSV | Input data file format |
Library | dplyr | Data manipulation and cleaning |
Library | ggplot2 | Data visualization |
Library | stats (built-in) | Performing K-means clustering |
Library | readr | Reading CSV files |
Optional Library | scales | Enhancing plot labels and formatting |
Kickstart your analytics journey with our free Introduction to Data Analysis using Excel course. Learn to clean, analyze, and visualize data like a pro
To begin, you need a dataset that contains customer-related information such as age, income, gender, and spending behavior. Websites like Kaggle offer free datasets you can download.
Setting Up Google Colab for R
Google Colab uses Python by default, so you need to switch the runtime to R:
Before analyzing the data, we need to install the necessary R libraries. These libraries will help with functions like data manipulation, visualization, clustering, and file handling. The code for this step is given in the code block below:
# Installing essential libraries
install.packages("dplyr") # For data manipulation
install.packages("ggplot2") # For creating visualizations
install.packages("factoextra") # For visualizing clustering results
install.packages("readr") # For reading CSV files into R
In this step, we will load the libraries we installed and then read the dataset into R. This will help us start working with the customer data. The code for this section is given in the code block below:
# Load required libraries
library(dplyr) # For data manipulation
library(ggplot2) # For plotting graphs
library(factoextra) # For visualizing clusters
library(readr) # For reading CSV files
# Read the uploaded CSV file into a data frame
data <- read_csv("Customer Data.csv")
# Display the first few rows of the dataset
head(data)
Popular Data Science Programs
The code for this section will return the output of the dataset.
InvoiceNo |
StockCode |
Description |
Quantity |
InvoiceDate |
UnitPrice |
CustomerID |
Country |
<chr> |
<chr> |
<chr> |
<dbl> |
<chr> |
<dbl> |
<dbl> |
<chr> |
536365 |
85123A |
WHITE HANGING HEART T-LIGHT HOLDER |
6 |
12/1/2010 8:26 |
2.55 |
17850 |
United Kingdom |
536365 |
71053 |
WHITE METAL LANTERN |
6 |
12/1/2010 8:26 |
3.39 |
17850 |
United Kingdom |
536365 |
84406B |
CREAM CUPID HEARTS COAT HANGER |
8 |
12/1/2010 8:26 |
2.75 |
17850 |
United Kingdom |
536365 |
84029G |
KNITTED UNION FLAG HOT WATER BOTTLE |
6 |
12/1/2010 8:26 |
3.39 |
17850 |
United Kingdom |
536365 |
84029E |
RED WOOLLY HOTTIE WHITE HEART. |
6 |
12/1/2010 8:26 |
3.39 |
17850 |
United Kingdom |
536365 |
22752 |
SET 7 BABUSHKA NESTING BOXES |
2 |
12/1/2010 8:26 |
7.65 |
17850 |
United Kingdom |
Uncover hidden patterns with ease! Master K-Means and more in this free Unsupervised Learning course. Enrol now and start clustering like a pro.
This step will help us understand the structure of the dataset. We will get basic statistics for each column and check for any missing values. The code for this section is given in the code block below:
# Check the structure of the dataset: column names and data types
str(data)
# Get summary statistics like mean, min, max for each column
summary(data)
# Check for missing values in each column
colSums(is.na(data))
The above code returns the output:
InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0
This means that:
In this step, we will transform raw transaction data into aggregated customer-level data. The code for this section is given in the code block below:
library(dplyr)
# Create a new column for total price of each transaction
data <- data %>%
mutate(TotalPrice = Quantity * UnitPrice)
# Remove rows where CustomerID is missing
data <- data %>%
filter(!is.na(CustomerID))
# Group data by CustomerID and calculate key features:
# - Total quantity purchased
# - Total amount spent
# - Average unit price per item
# - Number of transactions made
customer_data <- data %>%
group_by(CustomerID) %>%
summarise(
TotalQuantity = sum(Quantity, na.rm = TRUE),
TotalSpent = sum(TotalPrice, na.rm = TRUE),
AvgUnitPrice = mean(UnitPrice, na.rm = TRUE),
NumTransactions = n()
)
# View the first few rows of the summarized customer data
head(customer_data)
The output of this section:
CustomerID |
TotalQuantity |
TotalSpent |
AvgUnitPrice |
NumTransactions |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<int> |
12346 |
0 |
0.00 |
1.040000 |
2 |
12347 |
2458 |
4310.00 |
2.644011 |
182 |
12348 |
2341 |
1797.24 |
5.764839 |
31 |
12349 |
631 |
1757.55 |
8.289041 |
73 |
12350 |
197 |
334.40 |
3.841176 |
17 |
12352 |
470 |
1545.41 |
23.274737 |
95 |
Click to Learn More About: R For Data Science: Why Should You Choose R for Data Science?
Before applying clustering algorithms, it's important to remove non-numeric identifiers and normalize the features. The code for this step is given in the code block below.
# Remove CustomerID as it's just an identifier and not useful for clustering
customer_data_numeric <- customer_data %>%
select(-CustomerID)
# Normalize the numeric features so they have mean = 0 and standard deviation = 1
normalized_data <- scale(customer_data_numeric)
# View the first few rows of the normalized data
head(normalized_data)
TotalQuantity |
TotalSpent |
AvgUnitPrice |
NumTransactions |
-0.2401871 |
-0.23097457 |
-0.047864577 |
-0.391674900 |
0.2858369 |
0.29339811 |
-0.036799633 |
0.382613202 |
0.2607983 |
-0.01231481 |
-0.015271237 |
-0.266928483 |
-0.1051500 |
-0.01714367 |
0.002141461 |
-0.086261260 |
-0.1980281 |
-0.19029006 |
-0.028541230 |
-0.327150891 |
-0.1396048 |
-0.04295351 |
0.105517240 |
0.008373953 |
We apply K-means we can cluster the data effectively. We need to choose the right number of clusters (k). The Elbow Method helps identify the optimal value by plotting the within-cluster sum of squares (WSS) for different values of k. The code for this section is given in the code block below:
# Install and load factoextra if not already done
install.packages("factoextra")
library(factoextra)
# Use the Elbow Method to visualize how WSS changes with different k values
fviz_nbclust(normalized_data, kmeans, method = "wss") +
labs(title = "Elbow Method for Choosing K")
The output for this step is given below:
Must Read: Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices
After determining the optimal number of clusters (k = 4), we apply the K-means algorithm to group customers into segments. This step will assign a cluster label to each customer based on their behavior. The code for this section is given in the code block below:
# Apply K-means clustering with 4 clusters
set.seed(123) # Ensures consistent results every time you run the code
kmeans_result <- kmeans(normalized_data, centers = 4, nstart = 25)
# Add the cluster label (1 to 4) back to the original customer data
customer_data$Cluster <- as.factor(kmeans_result$cluster)
# View the first few rows of the customer data with cluster assignments
head(customer_data)
The output of the above code is:
CustomerID |
TotalQuantity |
TotalSpent |
AvgUnitPrice |
NumTransactions |
Cluster |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<int> |
<fct> |
12346 |
0 |
0.00 |
1.040000 |
2 |
2 |
12347 |
2458 |
4310.00 |
2.644011 |
182 |
2 |
12348 |
2341 |
1797.24 |
5.764839 |
31 |
2 |
12349 |
631 |
1757.55 |
8.289041 |
73 |
2 |
12350 |
197 |
334.40 |
3.841176 |
17 |
2 |
12352 |
470 |
1545.41 |
23.274737 |
95 |
2 |
The output now includes a new column called Cluster. This column represents the customer segment (from 1 to 4) assigned by the K-means algorithm.
In this step, we will generate a 2D scatter plot to visually explore how customers are grouped into different clusters. The code for this section is given in the code block below:
# Visualize the K-means clustering result
fviz_cluster(kmeans_result,
data = normalized_data,
geom = "point", # Show each customer as a point
ellipse.type = "norm", # Add normal-shaped boundary around clusters
palette = "jco", # Use a clean color palette
ggtheme = theme_minimal()) + # Minimalist theme for clarity
labs(title = "Customer Segmentation using K-Means") # Add plot title
Also Read: Data Science Project Ideas for Beginners | Python IDEs for Data Science and Machine Learning
This step will summarize each cluster to help us understand the common characteristics of customers in each segment. The code for this section is:
# Load dplyr if not already loaded
library(dplyr)
# Group customers by cluster and calculate average values for each segment
customer_data %>%
group_by(Cluster) %>%
summarise(
Avg_TotalQuantity = mean(TotalQuantity), # Average quantity bought per cluster
Avg_TotalSpent = mean(TotalSpent), # Average spending per cluster
Avg_UnitPrice = mean(AvgUnitPrice), # Average price per item
Avg_NumTransactions = mean(NumTransactions), # Average number of purchases
Count = n() # Number of customers in each cluster
)
The output for the above code is:
Cluster |
Avg_TotalQuantity |
Avg_TotalSpent |
Avg_UnitPrice |
Avg_NumTransactions |
Count |
<fct> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<int> |
1 |
5101.3312 |
8593.7480 |
3.788570 |
417.858 |
317 |
2 |
590.4525 |
984.5275 |
5.268309 |
58.868 |
4038 |
3 |
29.5000 |
-1819.0650 |
6171.705000 |
3.000 |
2 |
4 |
60364.0000 |
106930.9267 |
4.310349 |
2443.533 |
15 |
This table gives you a clear breakdown of customer behavior in each cluster.
Key Observations of This Output:
This step creates a simple bar chart to compare the average total spending for each customer cluster. The code for this section is given in the code block below:
# Create a bar plot showing average total spending per customer cluster
ggplot(customer_data, aes(x = Cluster, y = TotalSpent, fill = Cluster)) +
stat_summary(fun = mean, geom = "bar") + # Plot mean values as bars
labs(title = "Average Total Spending per Cluster", # Add title and axis label
y = "Avg Total Spent") +
theme_minimal() # Use a clean, minimal theme
The output for the above code is given below:
Here’s Something You Should Know: What is Data Wrangling? Exploring Its Role in Data Analysis
This step creates a bar chart that shows the average number of transactions made by customers in each cluster. The code for this section is given in the code block below:
# Bar plot showing average number of transactions per cluster
ggplot(customer_data, aes(x = Cluster, y = NumTransactions, fill = Cluster)) +
stat_summary(fun = mean, geom = "bar") + # Show mean as bars
labs(title = "Average Transactions per Cluster", # Add plot title
y = "Avg Transactions") +
theme_minimal() # Use a clean visual style
The output for the above code is given below.
In this customer segmentation project using R, we successfully performed customer segmentation and the K-means clustering algorithm. By using tools like Google Colab and libraries such as dplyr, ggplot2, factoextra, and readr, we cleaned and transformed raw transaction data into meaningful customer insights.
We were able to identify four distinct customer segments based on total spending, purchase frequency, and quantity.
This customer segmentation project helped us learn key concepts in data preprocessing, clustering, and visualization. The results can be used for targeted marketing, improving customer experience, and making better business decisions based on data.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/14QGFDhHQmYYN-M3luTyi8uAbEFMqx368#scrollTo=4lFIYq1r3qp8
779 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources