View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Customer Segmentation Project Using R: A Step-by-Step Guide

By Rohit Sharma

Updated on Jul 25, 2025 | 11 min read | 1.53K+ views

Share:

Businesses use customer segmentation to group customers based on similar behavior and traits. In this customer segmentation project using R, we'll perform customer segmentation using K-means clustering. 

We'll work with various R libraries like dplyr, ggplot2, and stats to clean, analyze, and visualize customer patterns using the data we have. 

This project will help you understand how to divide your customer base into relevant segments, which can further be used for targeted marketing, improved service, and profitable business decisions.

Redefine your future with upGrad’s cutting-edge Data Science and AI programs. Dive into Generative AI, Machine Learning, and Advanced Analytics—100% online, built for tomorrow’s leaders. Get started now.

Want More R Projects? Click Here: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Things You Should Know Before Starting the Project

There are various concepts you need to be familiar with before starting this project. Some of the most important ones are listed below.

  • Having a basic knowledge of R programming this includes variables, data frames, and functions.
  • You need to be familiar with data manipulation using dplyr
  • Basic understanding of data visualization with ggplot2 is helpful
  • Knowing the concept of clustering, especially the K-means algorithm, helps a lot
  • Some basic ideas of scaling and normalizing data
  • You should know how to install and use R packages in Google Colab or RStudio
  • Basic understanding of CSV files and data formats is necessary

Step into the future of data with upGrad’s next-gen programs in Generative AI and Data Science. Build real-world expertise, earn industry-recognized credentials, and stay ahead in the AI-powered era. Your journey begins now.

Project Duration, Difficulty, and Skill Level Required For This Project

  • Estimated Duration: 2 to 4 hours. This mainly depends on your familiarity with R and clustering concepts.
  • Difficulty Level: Easy to Moderate. This project is ideal for beginners with basic R knowledge.
  • Skill Level Required:
    • Basic R programming
    • Introductory understanding of data preprocessing
    • Beginner-level knowledge of clustering (K-means)
    • No prior experience with machine learning required

What Are The Tools and R Libraries Required For This Customer Segmentation Project

To make sure your project runs smoothly, you’ll need an army of tools and libraries to run the code for you. The tools and libraries required for this customer segmentation project using R are listed in the table below:

Category

Tool / Library

Purpose

Programming Language R Main language used for analysis and modeling
Platform Google Colab / RStudio Environment to write and run R code
Data Format CSV Input data file format
Library dplyr Data manipulation and cleaning
Library ggplot2 Data visualization
Library stats (built-in) Performing K-means clustering
Library readr Reading CSV files
Optional Library scales Enhancing plot labels and formatting

Kickstart your analytics journey with our free Introduction to Data Analysis using Excel course. Learn to clean, analyze, and visualize data like a pro

Detailed Explanation of This Customer Segmentation Project Using R

Step 1: Downloading the Dataset and Setting Up Google Colab for R

To begin, you need a dataset that contains customer-related information such as age, income, gender, and spending behavior. Websites like Kaggle offer free datasets you can download. 

Setting Up Google Colab for R

Google Colab uses Python by default, so you need to switch the runtime to R:

  1. Go to Google Colab
  2. Click on Runtime > Change runtime type
  3. Change the Runtime type to R
  4. Click Save

Step 2: Installing Required Libraries in R

Before analyzing the data, we need to install the necessary R libraries. These libraries will help with functions like data manipulation, visualization, clustering, and file handling. The code for this step is given in the code block below:

# Installing essential libraries
install.packages("dplyr")        # For data manipulation
install.packages("ggplot2")      # For creating visualizations
install.packages("factoextra")   # For visualizing clustering results
install.packages("readr")        # For reading CSV files into R

Step 3: Loading Libraries and Reading the Dataset

In this step, we will load the libraries we installed and then read the dataset into R. This will help us start working with the customer data. The code for this section is given in the code block below:

# Load required libraries
library(dplyr)        # For data manipulation
library(ggplot2)      # For plotting graphs
library(factoextra)   # For visualizing clusters
library(readr)        # For reading CSV files

# Read the uploaded CSV file into a data frame
data <- read_csv("Customer Data.csv")

# Display the first few rows of the dataset
head(data)

The code for this section will return the output of the dataset.

InvoiceNo

StockCode

Description

Quantity

InvoiceDate

UnitPrice

CustomerID

Country

<chr>

<chr>

<chr>

<dbl>

<chr>

<dbl>

<dbl>

<chr>

536365

85123A

WHITE HANGING HEART T-LIGHT HOLDER

6

12/1/2010 8:26

2.55

17850

United Kingdom

536365

71053

WHITE METAL LANTERN

6

12/1/2010 8:26

3.39

17850

United Kingdom

536365

84406B

CREAM CUPID HEARTS COAT HANGER

8

12/1/2010 8:26

2.75

17850

United Kingdom

536365

84029G

KNITTED UNION FLAG HOT WATER BOTTLE

6

12/1/2010 8:26

3.39

17850

United Kingdom

536365

84029E

RED WOOLLY HOTTIE WHITE HEART.

6

12/1/2010 8:26

3.39

17850

United Kingdom

536365

22752

SET 7 BABUSHKA NESTING BOXES

2

12/1/2010 8:26

7.65

17850

United Kingdom

Uncover hidden patterns with ease! Master K-Means and more in this free Unsupervised Learning course. Enrol now and start clustering like a pro.

Step 4: Exploring the Dataset Structure and Quality

This step will help us understand the structure of the dataset. We will get basic statistics for each column and check for any missing values. The code for this section is given in the code block below:

# Check the structure of the dataset: column names and data types
str(data)

# Get summary statistics like mean, min, max for each column
summary(data)

# Check for missing values in each column
colSums(is.na(data))

The above code returns the output:

InvoiceNo 0 
StockCode 0
Description 1454 
Quantity 0 
InvoiceDate 0 
UnitPrice 0 
CustomerID 135080 
Country 0

This means that:

  • Description has 1454 missing values
  • CustomerID has a very large number of missing values: 135,080
  • All other columns have 0 missing values

Step 5: Creating Customer-Level Features for Segmentation

In this step, we will transform raw transaction data into aggregated customer-level data. The code for this section is given in the code block below:

library(dplyr)

# Create a new column for total price of each transaction
data <- data %>%
  mutate(TotalPrice = Quantity * UnitPrice)

# Remove rows where CustomerID is missing
data <- data %>%
  filter(!is.na(CustomerID))

# Group data by CustomerID and calculate key features:
# - Total quantity purchased
# - Total amount spent
# - Average unit price per item
# - Number of transactions made
customer_data <- data %>%
  group_by(CustomerID) %>%
  summarise(
    TotalQuantity = sum(Quantity, na.rm = TRUE),
    TotalSpent = sum(TotalPrice, na.rm = TRUE),
    AvgUnitPrice = mean(UnitPrice, na.rm = TRUE),
    NumTransactions = n()
  )

# View the first few rows of the summarized customer data
head(customer_data)
background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The output of this section:

CustomerID

TotalQuantity

TotalSpent

AvgUnitPrice

NumTransactions

<dbl>

<dbl>

<dbl>

<dbl>

<int>

12346

0

0.00

1.040000

2

12347

2458

4310.00

2.644011

182

12348

2341

1797.24

5.764839

31

12349

631

1757.55

8.289041

73

12350

197

334.40

3.841176

17

12352

470

1545.41

23.274737

95

Click to Learn More About: R For Data Science: Why Should You Choose R for Data Science?

Step 6: Preparing the Data for Clustering

Before applying clustering algorithms, it's important to remove non-numeric identifiers and normalize the features. The code for this step is given in the code block below.

# Remove CustomerID as it's just an identifier and not useful for clustering
customer_data_numeric <- customer_data %>%
  select(-CustomerID)

# Normalize the numeric features so they have mean = 0 and standard deviation = 1
normalized_data <- scale(customer_data_numeric)

# View the first few rows of the normalized data
head(normalized_data)

TotalQuantity

TotalSpent

AvgUnitPrice

NumTransactions

-0.2401871

-0.23097457

-0.047864577

-0.391674900

0.2858369

0.29339811

-0.036799633

0.382613202

0.2607983

-0.01231481

-0.015271237

-0.266928483

-0.1051500

-0.01714367

0.002141461

-0.086261260

-0.1980281

-0.19029006

-0.028541230

-0.327150891

-0.1396048

-0.04295351

0.105517240

0.008373953

 

Step 7: Finding the Optimal Number of Clusters (K)

We apply K-means we can cluster the data effectively. We need to choose the right number of clusters (k). The Elbow Method helps identify the optimal value by plotting the within-cluster sum of squares (WSS) for different values of k. The code for this section is given in the code block below:

# Install and load factoextra if not already done
install.packages("factoextra")
library(factoextra)

# Use the Elbow Method to visualize how WSS changes with different k values
fviz_nbclust(normalized_data, kmeans, method = "wss") +
  labs(title = "Elbow Method for Choosing K")

The output for this step is given below:

Must Read: Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices

Step 8: Applying K-Means Clustering to Segment Customers

After determining the optimal number of clusters (k = 4), we apply the K-means algorithm to group customers into segments. This step will assign a cluster label to each customer based on their behavior. The code for this section is given in the code block below:

# Apply K-means clustering with 4 clusters
set.seed(123)  # Ensures consistent results every time you run the code
kmeans_result <- kmeans(normalized_data, centers = 4, nstart = 25)

# Add the cluster label (1 to 4) back to the original customer data
customer_data$Cluster <- as.factor(kmeans_result$cluster)

# View the first few rows of the customer data with cluster assignments
head(customer_data)

The output of the above code is:

CustomerID

TotalQuantity

TotalSpent

AvgUnitPrice

NumTransactions

Cluster

<dbl>

<dbl>

<dbl>

<dbl>

<int>

<fct>

12346

0

0.00

1.040000

2

2

12347

2458

4310.00

2.644011

182

2

12348

2341

1797.24

5.764839

31

2

12349

631

1757.55

8.289041

73

2

12350

197

334.40

3.841176

17

2

12352

470

1545.41

23.274737

95

2

 

The output now includes a new column called Cluster. This column represents the customer segment (from 1 to 4) assigned by the K-means algorithm.

Step 9: Visualizing Customer Segments

In this step, we will generate a 2D scatter plot to visually explore how customers are grouped into different clusters. The code for this section is given in the code block below:

# Visualize the K-means clustering result
fviz_cluster(kmeans_result,
             data = normalized_data,
             geom = "point",          # Show each customer as a point
             ellipse.type = "norm",   # Add normal-shaped boundary around clusters
             palette = "jco",         # Use a clean color palette
             ggtheme = theme_minimal()) +  # Minimalist theme for clarity
  labs(title = "Customer Segmentation using K-Means")  # Add plot title

Also Read: Data Science Project Ideas for BeginnersPython IDEs for Data Science and Machine Learning

Step 10: Profiling Customer Segments with Cluster Summary

This step will summarize each cluster to help us understand the common characteristics of customers in each segment. The code for this section is:

# Load dplyr if not already loaded
library(dplyr)

# Group customers by cluster and calculate average values for each segment
customer_data %>%
  group_by(Cluster) %>%
  summarise(
    Avg_TotalQuantity = mean(TotalQuantity),       # Average quantity bought per cluster
    Avg_TotalSpent = mean(TotalSpent),             # Average spending per cluster
    Avg_UnitPrice = mean(AvgUnitPrice),            # Average price per item
    Avg_NumTransactions = mean(NumTransactions),   # Average number of purchases
    Count = n()                                     # Number of customers in each cluster
  )

The output for the above code is:

Cluster

Avg_TotalQuantity

Avg_TotalSpent

Avg_UnitPrice

Avg_NumTransactions

Count

<fct>

<dbl>

<dbl>

<dbl>

<dbl>

<int>

1

5101.3312

8593.7480

3.788570

417.858

317

2

590.4525

984.5275

5.268309

58.868

4038

3

29.5000

-1819.0650

6171.705000

3.000

2

4

60364.0000

106930.9267

4.310349

2443.533

15

 

This table gives you a clear breakdown of customer behavior in each cluster. 

Key Observations of This Output:

  • Cluster 1 and 4 are likely your most valuable customers, especially Cluster 4, though it has very few people.
  • Cluster 2 includes the bulk of customers and reflects average spending.
  • Cluster 3 seems to contain outliers or problematic data (very high price, negative spend, very low count).

Step 11: Visualizing Average Spending Across Customer Segments

This step creates a simple bar chart to compare the average total spending for each customer cluster. The code for this section is given in the code block below:

# Create a bar plot showing average total spending per customer cluster
ggplot(customer_data, aes(x = Cluster, y = TotalSpent, fill = Cluster)) +
  stat_summary(fun = mean, geom = "bar") +                 # Plot mean values as bars
  labs(title = "Average Total Spending per Cluster",       # Add title and axis label
       y = "Avg Total Spent") +
  theme_minimal()                                          # Use a clean, minimal theme

The output for the above code is given below:

Here’s Something You Should Know: What is Data Wrangling? Exploring Its Role in Data Analysis

Step 12: Visualizing Average Number of Transactions per Cluster

This step creates a bar chart that shows the average number of transactions made by customers in each cluster. The code for this section is given in the code block below:

# Bar plot showing average number of transactions per cluster
ggplot(customer_data, aes(x = Cluster, y = NumTransactions, fill = Cluster)) +
  stat_summary(fun = mean, geom = "bar") +                      # Show mean as bars
  labs(title = "Average Transactions per Cluster",              # Add plot title
       y = "Avg Transactions") +
  theme_minimal()                                               # Use a clean visual style

The output for the above code is given below.

Conclusion

In this customer segmentation project using R, we successfully performed customer segmentation and the K-means clustering algorithm. By using tools like Google Colab and libraries such as dplyr, ggplot2, factoextra, and readr, we cleaned and transformed raw transaction data into meaningful customer insights. 

We were able to identify four distinct customer segments based on total spending, purchase frequency, and quantity.

This customer segmentation project helped us learn key concepts in data preprocessing, clustering, and visualization. The results can be used for targeted marketing, improving customer experience, and making better business decisions based on data.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/14QGFDhHQmYYN-M3luTyi8uAbEFMqx368#scrollTo=4lFIYq1r3qp8

Frequently Asked Questions (FAQs)

1. What is customer segmentation, and why is it important for businesses?

2. Which R libraries and tools are commonly used for customer segmentation projects?

3. Can I use other algorithms besides K-Means for customer segmentation?

4. What skills will I develop by completing this customer segmentation project in R?

5. What are some alternative data science projects I can try after this?

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months