Mall Customer Segmentation Project Using R
By Rohit Sharma
Updated on Aug 07, 2025 | 11 min read | 1.37K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 07, 2025 | 11 min read | 1.37K+ views
Share:
Table of Contents
Mall customer segmentation in R is a way to understand consumer behavior by grouping customers based on similarities in age, gender, income, and spending habits.
In this project, we will use R and the K-means clustering algorithm to analyze mall customer data and identify distinct customer segments. This method is used by businesses to target their marketing efforts effectively and improve customer satisfaction.
From loading and visualizing the data to scaling and clustering, this blog with step-by-step explanation will explain how to perform customer segmentation using R in Google Colab.
Lead the Data-Driven Future with upGrad’s Online Data Science Courses
Popular Data Science Programs
Read this to improve your R project-building skills: Top 25+ R Projects for Beginners.
This project requires certain skills and time. Although the time taken is subject to the skill level of an individual, they are mentioned in the table below.
Aspect |
Details |
Estimated Duration | 60–90 minutes |
Difficulty Level | Beginner-friendly |
Required Skills | Basic knowledge of R syntax, working with data frames, simple plotting with ggplot2, basic understanding of clustering concepts |
Accelerate your career with top-tier online programs from IIITB, OPJGU, and LJMU. Master AI, Data Science, and real-world analytics. Led by global faculty, built for tomorrow. Apply now to lead the tech frontier.
To complete this project on mall customer segmentation in R, you'll need a few essential tools and libraries. These tools will help with data manipulation, visualization, clustering, and overall project execution within the Google Colab environment.
Tool/Library |
Purpose/Usage |
Google Colab | Cloud-based environment to run R code |
R | Programming language used for analysis |
ggplot2 | For creating data visualizations |
dplyr | For efficient data manipulation and cleaning |
cluster | Provides clustering algorithms like K-means |
factoextra | For visualizing the results of clustering analysis |
In this section, we will break down the steps involved in building this project, along with the code and explanation.
To begin this project, you must set up your Google Colab environment to run R instead of Python. This ensures compatibility with R syntax and libraries throughout the analysis.
Steps to switch the runtime to R:
This step ensures all necessary R packages are available and loaded into your environment. These libraries will help you handle data, visualize insights, and perform clustering for customer segmentation. Here is the code for this step:
# Install required libraries (only once per session)
install.packages("ggplot2") # Visualization package
install.packages("dplyr") # For data wrangling and manipulation
install.packages("cluster") # For clustering algorithms like k-means
install.packages("factoextra") # For better visualization of clustering results
# Load the libraries into the session
library(ggplot2) # Load visualization tools
library(dplyr) # Load data manipulation functions
library(cluster) # Load clustering algorithms
library(factoextra) # Load cluster visualization tools
The output for the above code gives confirmation of the installation and loading of the required libraries and packages needed for this project.
Installing package into ‘/usr/local/lib/R/site-library’ Installing package into ‘/usr/local/lib/R/site-library’ Installing package into ‘/usr/local/lib/R/site-library’ Installing package into ‘/usr/local/lib/R/site-library’ also installing the dependencies ‘rbibutils’, ‘Deriv’, ‘microbenchmark’, ‘Rdpack’, ‘doBy’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘carData’, ‘Formula’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘crosstalk’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘corrplot’, ‘viridis’, ‘car’, ‘DT’, ‘ellipse’, ‘emmeans’, ‘flashClust’, ‘leaps’, ‘multcompView’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘plyr’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘reshape2’, ‘ggrepel’
Attaching package: ‘dplyr’
filter, lag
intersect, setdiff, setequal, union
|
Also Read: How to Build an Uber Data Analysis Project in R | Trend Analysis Project on COVID-19 using R
In this step, you will load the uploaded dataset into your R environment and preview the first few rows to get an initial understanding of the data. Here’s the code:
# Load your uploaded dataset
mall_data <- read.csv("Mall_Customers.csv") # Read the CSV file into a data frame
# Preview the first few rows
head(mall_data) # Display the top 6 rows to inspect the data structure
The above step gives us a glimpse of the dataset we’re using.
CustomerID |
Gender |
Age |
Annual.Income..k.. |
Spending.Score..1.100. |
|
<int> |
<chr> |
<int> |
<int> |
<int> |
|
1 |
1 |
Male |
19 |
15 |
39 |
2 |
2 |
Male |
21 |
15 |
81 |
3 |
3 |
Female |
20 |
16 |
6 |
4 |
4 |
Female |
23 |
16 |
77 |
5 |
5 |
Female |
31 |
17 |
40 |
6 |
6 |
Female |
22 |
17 |
76 |
This section helps you explore the dataset's structure and basic statistics. It shows the type of each column and summary values like minimum, mean, and maximum. Here’s the code for this step:
# See the structure of the dataset
str(mall_data) # View data types and structure of each column
# Get basic summary statistics
summary(mall_data) # Get min, max, mean, and quartiles for numeric columns
The output for the above code is:
'data.frame': 200 obs. of 5 variables: $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ... $ Gender : chr "Male" "Male" "Female" "Female" ... $ Age : int 19 21 20 23 31 22 35 23 64 30 ... $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ... $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
CustomerID Gender Age Annual.Income..k.. Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50 Median :100.50 Mode :character Median :36.00 Median : 61.50 Mean :100.50 Mean :38.85 Mean : 60.56 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00 Max. :200.00 Max. :70.00 Max. :137.00 Spending.Score..1.100. Min. : 1.00 1st Qu.:34.75 Median :50.00 Mean :50.20 3rd Qu.:73.00 Max. :99.00 |
Here are a few R Projects: Wine Quality Prediction Project in R | Loan Approval Classification Using Logistic Regression in R
This step removes irrelevant columns like CustomerID, which don't contribute to clustering. It also checks for any missing values in the dataset. The code for this step is:
# Remove 'CustomerID' as it's just an identifier
mall_data <- mall_data %>% select(-CustomerID) # Drop the CustomerID column
# Check for any missing values
colSums(is.na(mall_data)) # Sum of NA values in each column
The output for the above code is:
Gender 0 Age 0 Annual.Income..k..0 Spending.Score..1.100.0 |
This step uses basic plots to explore customer demographics and spending behavior. It helps you identify patterns and relationships in the data before applying clustering. Here’s the code to generate the graphs.
# Gender distribution
ggplot(mall_data, aes(x = Gender)) +
geom_bar(fill = "lightblue") + # Bar chart for gender count
ggtitle("Gender Distribution") # Add a title to the chart
# Age distribution
ggplot(mall_data, aes(x = Age)) +
geom_histogram(fill = "orange", bins = 20) + # Histogram to show age spread
ggtitle("Age Distribution") # Add a title
# Income vs Spending Score
ggplot(mall_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
geom_point(color = "darkgreen") + # Scatter plot of income vs spending
ggtitle("Income vs Spending Score") # Add a title
The above code produces three graphs, based on gender distribution, age distribution, and income vs spending score.
Clustering algorithms require numeric input. This step converts the Gender column from text to numeric format to make it suitable for clustering. The code is:
# Convert Gender to numeric: Male = 1, Female = 0
mall_data$Gender <- ifelse(mall_data$Gender == "Male", 1, 0) # Replace "Male" with 1 and "Female" with 0
# Check again
str(mall_data) # Confirm the data type of the Gender column is now numeric
The output for this section is:
'data.frame': 200 obs. of 4 variables: $ Gender : num 1 1 0 0 0 0 0 0 1 0 ... $ Age : int 19 21 20 23 31 22 35 23 64 30 ... $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ... $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ... |
Clustering is sensitive to differences in scale. This step standardizes the dataset so that each feature contributes equally to the distance calculations. The code to normalize the data is:
# Normalize the dataset
mall_scaled <- scale(mall_data) # Standardize all numeric features
# Check scaled data
head(mall_scaled) # Preview the first few rows of the normalized dataset
The output for this step is given below:
Gender |
Age |
Annual.Income..k.. |
Spending.Score..1.100. |
1.1253282 |
-1.4210029 |
-1.734646 |
-0.4337131 |
1.1253282 |
-1.2778288 |
-1.734646 |
1.1927111 |
-0.8841865 |
-1.3494159 |
-1.696572 |
-1.7116178 |
-0.8841865 |
-1.1346547 |
-1.696572 |
1.0378135 |
-0.8841865 |
-0.5619583 |
-1.658498 |
-0.3949887 |
-0.8841865 |
-1.2062418 |
-1.658498 |
0.9990891 |
Here are some R projects: Car Data Analysis Project Using R | Player Performance Analysis & Prediction Using R
Before applying K-means clustering, it’s important to decide how many clusters (k) to create. The Elbow Method helps identify the ideal number by plotting the within-cluster sum of squares (WSS) for different values of k. Here’s the code:
# Elbow method to choose best k
fviz_nbclust(mall_scaled, kmeans, method = "wss") + # Plot WSS for each k
geom_vline(xintercept = 5, linetype = 2) + # Add a vertical line at the optimal k (e.g., 5)
ggtitle("Elbow Method to Find Optimal k") # Add a plot title
The above code gives us a graph:
The above graph shows that:
Once the optimal number of clusters is known, K-means is applied to group similar customers. This step creates cluster labels and attaches them to the original dataset. Here’s the code:
# Run k-means with 5 clusters
set.seed(123) # Set seed for reproducibility
kmeans_model <- kmeans(mall_scaled, centers = 5, nstart = 25) # Apply K-means with 5 clusters
# Add cluster assignments to original data
mall_data$Cluster <- as.factor(kmeans_model$cluster) # Store cluster number as a new column
# View with cluster info
head(mall_data) # See the first few rows including the new Cluster column
The output for the above code gives us a table:
Gender |
Age |
Annual.Income..k.. |
Spending.Score..1.100. |
Cluster |
|
<dbl> |
<int> |
<int> |
<int> |
<fct> |
|
1 |
1 |
19 |
15 |
39 |
2 |
2 |
1 |
21 |
15 |
81 |
2 |
3 |
0 |
20 |
16 |
6 |
5 |
4 |
0 |
23 |
16 |
77 |
1 |
5 |
0 |
31 |
17 |
40 |
5 |
6 |
0 |
22 |
17 |
76 |
1 |
This step generates a cluster plot to visually understand how the customers are grouped. Each point represents a customer, and colors indicate their assigned cluster. The code is:
# Plot clusters
fviz_cluster(kmeans_model, data = mall_scaled,
ellipse.type = "convex", # Draw convex shapes around clusters
palette = "jco", # Use a predefined color palette
ggtheme = theme_minimal()) # Apply a clean minimal theme
The above code creates a cluster plot:
The above cluster graph shows that:’
Must Read: 18 Types of Regression in Machine Learning You Should Know
After forming clusters, this step helps summarize the average attributes of customers within each group. It gives insights into how different customer segments behave based on features like age, income, and spending. The code is:
# Summarize each cluster
mall_data %>%
group_by(Cluster) %>% # Group data by cluster
summarise_all(mean) # Calculate mean of each feature within each cluster
The output for this step is:
Cluster |
Gender |
Age |
Annual.Income..k.. |
Spending.Score..1.100. |
<fct> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
1 |
0.0000000 |
28.39286 |
60.42857 |
68.17857 |
2 |
1.0000000 |
28.25000 |
62.00000 |
71.67500 |
3 |
1.0000000 |
55.90323 |
48.77419 |
38.80645 |
4 |
0.5483871 |
40.41935 |
90.00000 |
15.74194 |
5 |
0.0000000 |
49.14286 |
46.33333 |
39.61905 |
The above table shows that:
In this Mall Customer Segmentation project, we used the K-means clustering algorithm in R on Google Colab to segment customers based on their annual income and spending score.
After preprocessing the dataset, we visualized patterns using scatter plots and applied the Elbow Method to determine the optimal number of clusters, which was found to be 5. The K-means model successfully grouped customers into distinct segments. These segments can help businesses tailor marketing strategies, personalize offers, and improve customer targeting based on spending behavior and income levels.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1IpC3qGhGDMixBdNNLtZpQbuH0HCOQNsW#scrollTo=V7CT2Q_9Zkc1
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources