Mall Customer Segmentation Project Using R

By Rohit Sharma

Updated on Aug 07, 2025 | 11 min read | 1.37K+ views

Share:

Mall customer segmentation in R is a way to understand consumer behavior by grouping customers based on similarities in age, gender, income, and spending habits. 

In this project, we will use R and the K-means clustering algorithm to analyze mall customer data and identify distinct customer segments. This method is used by businesses to target their marketing efforts effectively and improve customer satisfaction. 

From loading and visualizing the data to scaling and clustering, this blog with step-by-step explanation will explain how to perform customer segmentation using R in Google Colab.

Lead the Data-Driven Future with upGrad’s Online Data Science Courses

Build your expertise in AI, Machine Learning, and Analytics with globally recognized, industry-aligned programs. Our online data science courses equip you with the skills to innovate, influence, and inspire. Enrol now and transform your career.

Read this to improve your R project-building skills: Top 25+ R Projects for Beginners.

Project Duration, Difficulty, and Skill Level Required

This project requires certain skills and time. Although the time taken is subject to the skill level of an individual, they are mentioned in the table below.

Aspect

Details

Estimated Duration 60–90 minutes
Difficulty Level Beginner-friendly
Required Skills Basic knowledge of R syntax, working with data frames, simple plotting with ggplot2, basic understanding of clustering concepts

Accelerate your career with top-tier online programs from IIITB, OPJGU, and LJMU. Master AI, Data Science, and real-world analytics. Led by global faculty, built for tomorrow. Apply now to lead the tech frontier.

Essential Tools and R Libraries for Mall Customer Segmentation Project

To complete this project on mall customer segmentation in R, you'll need a few essential tools and libraries. These tools will help with data manipulation, visualization, clustering, and overall project execution within the Google Colab environment.

Tool/Library

Purpose/Usage

Google Colab Cloud-based environment to run R code
R Programming language used for analysis
ggplot2 For creating data visualizations
dplyr For efficient data manipulation and cleaning
cluster Provides clustering algorithms like K-means
factoextra For visualizing the results of clustering analysis

Complete Step-by-Step Guide to the Mall Customer Segmentation Project in R

In this section, we will break down the steps involved in building this project, along with the code and explanation.

Step 1: Configure Google Colab for R Programming

To begin this project, you must set up your Google Colab environment to run R instead of Python. This ensures compatibility with R syntax and libraries throughout the analysis.

Steps to switch the runtime to R:

  1. Open a new notebook in Google Colab.
  2. In the top menu, go to Runtime.
  3. Select Change runtime type.
  4. In the Runtime type dropdown, choose R.
  5. Click Save to apply the change.

Step 2: Install and Load Required Libraries

This step ensures all necessary R packages are available and loaded into your environment. These libraries will help you handle data, visualize insights, and perform clustering for customer segmentation. Here is the code for this step:

# Install required libraries (only once per session)
install.packages("ggplot2")     # Visualization package
install.packages("dplyr")       # For data wrangling and manipulation
install.packages("cluster")     # For clustering algorithms like k-means
install.packages("factoextra")  # For better visualization of clustering results

# Load the libraries into the session
library(ggplot2)      # Load visualization tools
library(dplyr)        # Load data manipulation functions
library(cluster)      # Load clustering algorithms
library(factoextra)   # Load cluster visualization tools

The output for the above code gives confirmation of the installation and loading of the required libraries and packages needed for this project.

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘rbibutils’, ‘Deriv’, ‘microbenchmark’, ‘Rdpack’, ‘doBy’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘carData’, ‘Formula’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘crosstalk’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘corrplot’, ‘viridis’, ‘car’, ‘DT’, ‘ellipse’, ‘emmeans’, ‘flashClust’, ‘leaps’, ‘multcompView’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘plyr’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘reshape2’, ‘ggrepel’

 

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

   filter, lag


The following objects are masked from ‘package:base’:

   intersect, setdiff, setequal, union


Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Also Read: How to Build an Uber Data Analysis Project in RTrend Analysis Project on COVID-19 using R

Step 3: Load and Preview the Dataset

In this step, you will load the uploaded dataset into your R environment and preview the first few rows to get an initial understanding of the data. Here’s the code:

# Load your uploaded dataset
mall_data <- read.csv("Mall_Customers.csv")  # Read the CSV file into a data frame

# Preview the first few rows
head(mall_data)  # Display the top 6 rows to inspect the data structure

The above step gives us a glimpse of the dataset we’re using.

 

CustomerID

Gender

Age

Annual.Income..k..

Spending.Score..1.100.

 

<int>

<chr>

<int>

<int>

<int>

1

1

Male

19

15

39

2

2

Male

21

15

81

3

3

Female

20

16

6

4

4

Female

23

16

77

5

5

Female

31

17

40

6

6

Female

22

17

76

 

Step 4: Understand the Structure and Summary of the Data

This section helps you explore the dataset's structure and basic statistics. It shows the type of each column and summary values like minimum, mean, and maximum. Here’s the code for this step:

# See the structure of the dataset
str(mall_data)     # View data types and structure of each column

# Get basic summary statistics
summary(mall_data) # Get min, max, mean, and quartiles for numeric columns

The output for the above code is:

'data.frame': 200 obs. of  5 variables:

 $ CustomerID            : int  1 2 3 4 5 6 7 8 9 10 ...

 $ Gender                : chr  "Male" "Male" "Female" "Female" ...

 $ Age                   : int  19 21 20 23 31 22 35 23 64 30 ...

 $ Annual.Income..k..    : int  15 15 16 16 17 17 18 18 19 19 ...

 $ Spending.Score..1.100.: int  39 81 6 77 40 76 6 94 3 72 ...

 

  CustomerID        Gender               Age        Annual.Income..k..

 Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00    

 1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50    

 Median :100.50   Mode  :character   Median :36.00   Median : 61.50    

 Mean   :100.50                      Mean   :38.85   Mean   : 60.56    

 3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00    

 Max.   :200.00                      Max.   :70.00   Max.   :137.00    

 Spending.Score..1.100.

 Min.   : 1.00         

 1st Qu.:34.75         

 Median :50.00         

 Mean   :50.20         

 3rd Qu.:73.00         

 Max.   :99.00       

Here are a few R Projects: Wine Quality Prediction Project in RLoan Approval Classification Using Logistic Regression in R 

Step 5: Clean the Data by Removing Unnecessary Columns and Checking for Missing Values

This step removes irrelevant columns like CustomerID, which don't contribute to clustering. It also checks for any missing values in the dataset. The code for this step is:

# Remove 'CustomerID' as it's just an identifier
mall_data <- mall_data %>% select(-CustomerID)  # Drop the CustomerID column

# Check for any missing values
colSums(is.na(mall_data))  # Sum of NA values in each column

The output for the above code is:

Gender 0 Age 0 Annual.Income..k..0 Spending.Score..1.100.0

Step 6: Explore the Data Through Visualizations

This step uses basic plots to explore customer demographics and spending behavior. It helps you identify patterns and relationships in the data before applying clustering. Here’s the code to generate the graphs.

# Gender distribution
ggplot(mall_data, aes(x = Gender)) + 
  geom_bar(fill = "lightblue") +  # Bar chart for gender count
  ggtitle("Gender Distribution")   # Add a title to the chart

# Age distribution
ggplot(mall_data, aes(x = Age)) +
  geom_histogram(fill = "orange", bins = 20) +  # Histogram to show age spread
  ggtitle("Age Distribution")                   # Add a title

# Income vs Spending Score
ggplot(mall_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
  geom_point(color = "darkgreen") +             # Scatter plot of income vs spending
  ggtitle("Income vs Spending Score")           # Add a title

The above code produces three graphs, based on gender distribution, age distribution, and income vs spending score.

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Step 7: Convert Categorical Data into Numeric Format

Clustering algorithms require numeric input. This step converts the Gender column from text to numeric format to make it suitable for clustering. The code is:

# Convert Gender to numeric: Male = 1, Female = 0
mall_data$Gender <- ifelse(mall_data$Gender == "Male", 1, 0)  # Replace "Male" with 1 and "Female" with 0

# Check again
str(mall_data)  # Confirm the data type of the Gender column is now numeric

The output for this section is:

'data.frame': 200 obs. of  4 variables:

 $ Gender                : num  1 1 0 0 0 0 0 0 1 0 ...

 $ Age                   : int  19 21 20 23 31 22 35 23 64 30 ...

 $ Annual.Income..k..    : int  15 15 16 16 17 17 18 18 19 19 ...

 $ Spending.Score..1.100.: int  39 81 6 77 40 76 6 94 3 72 ...

Step 8: Normalize the Data for Clustering

Clustering is sensitive to differences in scale. This step standardizes the dataset so that each feature contributes equally to the distance calculations. The code to normalize the data is:

# Normalize the dataset
mall_scaled <- scale(mall_data)  # Standardize all numeric features

# Check scaled data
head(mall_scaled)  # Preview the first few rows of the normalized dataset

The output for this step is given below:

Gender

Age

Annual.Income..k..

Spending.Score..1.100.

1.1253282

-1.4210029

-1.734646

-0.4337131

1.1253282

-1.2778288

-1.734646

1.1927111

-0.8841865

-1.3494159

-1.696572

-1.7116178

-0.8841865

-1.1346547

-1.696572

1.0378135

-0.8841865

-0.5619583

-1.658498

-0.3949887

-0.8841865

-1.2062418

-1.658498

0.9990891

 

Here are some R projects:  Car Data Analysis Project Using RPlayer Performance Analysis & Prediction Using R

Step 9: Determine the Optimal Number of Clusters Using the Elbow Method

Before applying K-means clustering, it’s important to decide how many clusters (k) to create. The Elbow Method helps identify the ideal number by plotting the within-cluster sum of squares (WSS) for different values of k. Here’s the code:

# Elbow method to choose best k
fviz_nbclust(mall_scaled, kmeans, method = "wss") +     # Plot WSS for each k
  geom_vline(xintercept = 5, linetype = 2) +             # Add a vertical line at the optimal k (e.g., 5)
  ggtitle("Elbow Method to Find Optimal k")              # Add a plot title

The above code gives us a graph:

The above graph shows that:

  • The X-axis shows the number of clusters (k), and the Y-axis shows the total within-cluster variation (how tightly grouped the data points are in each cluster).
  • As k increases, the variation (error) decreases, meaning the clusters are fitting better.
  • The “elbow point” is at k = 5, where the drop in error slows down; this is usually considered the optimal number of clusters.
  • After k = 5, adding more clusters gives diminishing returns, so 5 clusters balance accuracy and simplicity.

Step 10: Apply K-Means Clustering to Segment Customers

Once the optimal number of clusters is known, K-means is applied to group similar customers. This step creates cluster labels and attaches them to the original dataset. Here’s the code:

# Run k-means with 5 clusters
set.seed(123)  # Set seed for reproducibility
kmeans_model <- kmeans(mall_scaled, centers = 5, nstart = 25)  # Apply K-means with 5 clusters

# Add cluster assignments to original data
mall_data$Cluster <- as.factor(kmeans_model$cluster)  # Store cluster number as a new column

# View with cluster info
head(mall_data)  # See the first few rows including the new Cluster column

The output for the above code gives us a table:

 

Gender

Age

Annual.Income..k..

Spending.Score..1.100.

Cluster

 

<dbl>

<int>

<int>

<int>

<fct>

1

1

19

15

39

2

2

1

21

15

81

2

3

0

20

16

6

5

4

0

23

16

77

1

5

0

31

17

40

5

6

0

22

17

76

1

 

Step 11: Visualize the Customer Clusters

This step generates a cluster plot to visually understand how the customers are grouped. Each point represents a customer, and colors indicate their assigned cluster. The code is:

# Plot clusters
fviz_cluster(kmeans_model, data = mall_scaled, 
             ellipse.type = "convex",       # Draw convex shapes around clusters
             palette = "jco",               # Use a predefined color palette
             ggtheme = theme_minimal())     # Apply a clean minimal theme

The above code creates a cluster plot:

The above cluster graph shows that:’

  • Each point is a customer, plotted based on their features like Age, Income, and Spending, reduced to 2 dimensions (Dim1 & Dim2).
  • Different shapes and colors show 5 customer segments, grouped using clustering (like K-Means).
  • Clusters are well-separated, meaning customers within each group behave similarly and are different from other groups.
  • Cluster boundaries form distinct regions, helping visually understand how the customers are grouped.
  • Customer IDs are labeled, allowing you to track individual data points within each segment.

Must Read: 18 Types of Regression in Machine Learning You Should Know

Step 12: Analyze the Characteristics of Each Cluster

After forming clusters, this step helps summarize the average attributes of customers within each group. It gives insights into how different customer segments behave based on features like age, income, and spending. The code is:

# Summarize each cluster
mall_data %>%
  group_by(Cluster) %>%         # Group data by cluster
  summarise_all(mean)           # Calculate mean of each feature within each cluster

The output for this step is:

Cluster

Gender

Age

Annual.Income..k..

Spending.Score..1.100.

<fct>

<dbl>

<dbl>

<dbl>

<dbl>

1

0.0000000

28.39286

60.42857

68.17857

2

1.0000000

28.25000

62.00000

71.67500

3

1.0000000

55.90323

48.77419

38.80645

4

0.5483871

40.41935

90.00000

15.74194

5

0.0000000

49.14286

46.33333

39.61905

 

The above table shows that:

  • Clusters 1 & 2: Young customers (one all-female, one all-male) with moderate income and high spending – ideal for targeted marketing.
  • Cluster 3: Older males with lower income and average spending – less likely to spend big, but still active.
  • Cluster 4: Middle-aged, mixed-gender, high income but very low spending – untapped potential worth exploring.
  • Cluster 5: Older females with low income and moderate spending – consistent but limited spenders.

Conclusion

In this Mall Customer Segmentation project, we used the K-means clustering algorithm in R on Google Colab to segment customers based on their annual income and spending score.

After preprocessing the dataset, we visualized patterns using scatter plots and applied the Elbow Method to determine the optimal number of clusters, which was found to be 5. The K-means model successfully grouped customers into distinct segments. These segments can help businesses tailor marketing strategies, personalize offers, and improve customer targeting based on spending behavior and income levels.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1IpC3qGhGDMixBdNNLtZpQbuH0HCOQNsW#scrollTo=V7CT2Q_9Zkc1

Frequently Asked Questions (FAQs)

1. How does customer segmentation benefit retail businesses?

2. Can I perform this project without any prior machine learning experience?

3. What kind of dataset is suitable for customer segmentation?

4. Which visualizations are helpful in understanding clusters?

5. What are some other beginner-friendly machine learning projects to try?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months