Home
Blog
Data Science
Mall Customer Segmentation Project Using R

Mall Customer Segmentation Project Using R

Q: 5. What are some other beginner-friendly machine learning projects to try?

Here are 4 more projects you can explore: Gender Recognition Using Voice Spotify Music Data Analysis Titanic Survival Prediction Natural Disaster Prediction

By Rohit Sharma

Updated on Aug 07, 2025 | 11 min read | 1.62K+ views

Table of Contents

View all

Project Duration, Difficulty, and Skill Level Required
Essential Tools and R Libraries for Mall Customer Segmentation Project
Complete Step-by-Step Guide to the Mall Customer Segmentation Project in R
Conclusion

Mall customer segmentation in R is a way to understand consumer behavior by grouping customers based on similarities in age, gender, income, and spending habits.

In this project, we will use R and the K-means clustering algorithm to analyze mall customer data and identify distinct customer segments. This method is used by businesses to target their marketing efforts effectively and improve customer satisfaction.

From loading and visualizing the data to scaling and clustering, this blog with step-by-step explanation will explain how to perform customer segmentation using R in Google Colab.

Lead the Data-Driven Future with upGrad’s Online Data Science Courses

Popular Data Science Programs

Data Science Advanced Course M Sc in Data Science Degree Data Science Machine Learning Course DevOps Course Online PGD in Data Science

Build your expertise in AI, Machine Learning, and Analytics with globally recognized, industry-aligned programs. Our online data science courses equip you with the skills to innovate, influence, and inspire. Enrol now and transform your career.

Read this to improve your R project-building skills: Top 25+ R Projects for Beginners.

Project Duration, Difficulty, and Skill Level Required

This project requires certain skills and time. Although the time taken is subject to the skill level of an individual, they are mentioned in the table below.

Aspect	Details
Estimated Duration	60–90 minutes
Difficulty Level	Beginner-friendly
Required Skills	Basic knowledge of R syntax, working with data frames, simple plotting with ggplot2, basic understanding of clustering concepts

Accelerate your career with top-tier online programs from IIITB, OPJGU, and LJMU. Master AI, Data Science, and real-world analytics. Led by global faculty, built for tomorrow. Apply now to lead the tech frontier.

Essential Tools and R Libraries for Mall Customer Segmentation Project

To complete this project on mall customer segmentation in R, you'll need a few essential tools and libraries. These tools will help with data manipulation, visualization, clustering, and overall project execution within the Google Colab environment.

Tool/Library	Purpose/Usage
Google Colab	Cloud-based environment to run R code
R	Programming language used for analysis
ggplot2	For creating data visualizations
dplyr	For efficient data manipulation and cleaning
cluster	Provides clustering algorithms like K-means
factoextra	For visualizing the results of clustering analysis

Complete Step-by-Step Guide to the Mall Customer Segmentation Project in R

In this section, we will break down the steps involved in building this project, along with the code and explanation.

Step 1: Configure Google Colab for R Programming

To begin this project, you must set up your Google Colab environment to run R instead of Python. This ensures compatibility with R syntax and libraries throughout the analysis.

Steps to switch the runtime to R:

Open a new notebook in Google Colab.
In the top menu, go to Runtime.
Select Change runtime type.
In the Runtime type dropdown, choose R.
Click Save to apply the change.

Step 2: Install and Load Required Libraries

This step ensures all necessary R packages are available and loaded into your environment. These libraries will help you handle data, visualize insights, and perform clustering for customer segmentation. Here is the code for this step:

# Install required libraries (only once per session)
install.packages("ggplot2")     # Visualization package
install.packages("dplyr")       # For data wrangling and manipulation
install.packages("cluster")     # For clustering algorithms like k-means
install.packages("factoextra")  # For better visualization of clustering results

# Load the libraries into the session
library(ggplot2)      # Load visualization tools
library(dplyr)        # Load data manipulation functions
library(cluster)      # Load clustering algorithms
library(factoextra)   # Load cluster visualization tools

The output for the above code gives confirmation of the installation and loading of the required libraries and packages needed for this project.

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘rbibutils’, ‘Deriv’, ‘microbenchmark’, ‘Rdpack’, ‘doBy’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘carData’, ‘Formula’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘crosstalk’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘corrplot’, ‘viridis’, ‘car’, ‘DT’, ‘ellipse’, ‘emmeans’, ‘flashClust’, ‘leaps’, ‘multcompView’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘plyr’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘reshape2’, ‘ggrepel’

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Also Read: How to Build an Uber Data Analysis Project in R | Trend Analysis Project on COVID-19 using R

Step 3: Load and Preview the Dataset

In this step, you will load the uploaded dataset into your R environment and preview the first few rows to get an initial understanding of the data. Here’s the code:

# Load your uploaded dataset
mall_data <- read.csv("Mall_Customers.csv")  # Read the CSV file into a data frame

# Preview the first few rows
head(mall_data)  # Display the top 6 rows to inspect the data structure

The above step gives us a glimpse of the dataset we’re using.

	CustomerID	Gender	Age	Annual.Income..k..	Spending.Score..1.100.
	<int>	<chr>	<int>	<int>	<int>
1	1	Male	19	15	39
2	2	Male	21	15	81
3	3	Female	20	16	6
4	4	Female	23	16	77
5	5	Female	31	17	40
6	6	Female	22	17	76

Step 4: Understand the Structure and Summary of the Data

This section helps you explore the dataset's structure and basic statistics. It shows the type of each column and summary values like minimum, mean, and maximum. Here’s the code for this step:

# See the structure of the dataset
str(mall_data)     # View data types and structure of each column

# Get basic summary statistics
summary(mall_data) # Get min, max, mean, and quartiles for numeric columns

The output for the above code is:

'data.frame': 200 obs. of 5 variables:

$ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...

$ Gender : chr "Male" "Male" "Female" "Female" ...

$ Age : int 19 21 20 23 31 22 35 23 64 30 ...

$ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...

$ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...

CustomerID Gender Age Annual.Income..k..

Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00

1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50

Median :100.50 Mode :character Median :36.00 Median : 61.50

Mean :100.50 Mean :38.85 Mean : 60.56

3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00

Max. :200.00 Max. :70.00 Max. :137.00

Spending.Score..1.100.

Min. : 1.00

1st Qu.:34.75

Median :50.00

Mean :50.20

3rd Qu.:73.00

Max. :99.00

Here are a few R Projects: Wine Quality Prediction Project in R | Loan Approval Classification Using Logistic Regression in R

Step 5: Clean the Data by Removing Unnecessary Columns and Checking for Missing Values

This step removes irrelevant columns like CustomerID, which don't contribute to clustering. It also checks for any missing values in the dataset. The code for this step is:

# Remove 'CustomerID' as it's just an identifier
mall_data <- mall_data %>% select(-CustomerID)  # Drop the CustomerID column

# Check for any missing values
colSums(is.na(mall_data))  # Sum of NA values in each column

The output for the above code is:

Gender 0 Age 0 Annual.Income..k..0 Spending.Score..1.100.0

Step 6: Explore the Data Through Visualizations

This step uses basic plots to explore customer demographics and spending behavior. It helps you identify patterns and relationships in the data before applying clustering. Here’s the code to generate the graphs.

# Gender distribution
ggplot(mall_data, aes(x = Gender)) + 
  geom_bar(fill = "lightblue") +  # Bar chart for gender count
  ggtitle("Gender Distribution")   # Add a title to the chart

# Age distribution
ggplot(mall_data, aes(x = Age)) +
  geom_histogram(fill = "orange", bins = 20) +  # Histogram to show age spread
  ggtitle("Age Distribution")                   # Add a title

# Income vs Spending Score
ggplot(mall_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
  geom_point(color = "darkgreen") +             # Scatter plot of income vs spending
  ggtitle("Income vs Spending Score")           # Add a title

The above code produces three graphs, based on gender distribution, age distribution, and income vs spending score.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Step 7: Convert Categorical Data into Numeric Format

Clustering algorithms require numeric input. This step converts the Gender column from text to numeric format to make it suitable for clustering. The code is:

# Convert Gender to numeric: Male = 1, Female = 0
mall_data$Gender <- ifelse(mall_data$Gender == "Male", 1, 0)  # Replace "Male" with 1 and "Female" with 0

# Check again
str(mall_data)  # Confirm the data type of the Gender column is now numeric

The output for this section is:

'data.frame': 200 obs. of 4 variables:

$ Gender : num 1 1 0 0 0 0 0 0 1 0 ...

$ Age : int 19 21 20 23 31 22 35 23 64 30 ...

$ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...

$ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...

Step 8: Normalize the Data for Clustering

Clustering is sensitive to differences in scale. This step standardizes the dataset so that each feature contributes equally to the distance calculations. The code to normalize the data is:

# Normalize the dataset
mall_scaled <- scale(mall_data)  # Standardize all numeric features

# Check scaled data
head(mall_scaled)  # Preview the first few rows of the normalized dataset

The output for this step is given below:

Gender	Age	Annual.Income..k..	Spending.Score..1.100.
1.1253282	-1.4210029	-1.734646	-0.4337131
1.1253282	-1.2778288	-1.734646	1.1927111
-0.8841865	-1.3494159	-1.696572	-1.7116178
-0.8841865	-1.1346547	-1.696572	1.0378135
-0.8841865	-0.5619583	-1.658498	-0.3949887
-0.8841865	-1.2062418	-1.658498	0.9990891

Here are some R projects: Car Data Analysis Project Using R | Player Performance Analysis & Prediction Using R

Step 9: Determine the Optimal Number of Clusters Using the Elbow Method

Before applying K-means clustering, it’s important to decide how many clusters (k) to create. The Elbow Method helps identify the ideal number by plotting the within-cluster sum of squares (WSS) for different values of k. Here’s the code:

# Elbow method to choose best k
fviz_nbclust(mall_scaled, kmeans, method = "wss") +     # Plot WSS for each k
  geom_vline(xintercept = 5, linetype = 2) +             # Add a vertical line at the optimal k (e.g., 5)
  ggtitle("Elbow Method to Find Optimal k")              # Add a plot title

The above code gives us a graph:

The above graph shows that:

The X-axis shows the number of clusters (k), and the Y-axis shows the total within-cluster variation (how tightly grouped the data points are in each cluster).
As k increases, the variation (error) decreases, meaning the clusters are fitting better.
The “elbow point” is at k = 5, where the drop in error slows down; this is usually considered the optimal number of clusters.
After k = 5, adding more clusters gives diminishing returns, so 5 clusters balance accuracy and simplicity.

Step 10: Apply K-Means Clustering to Segment Customers

Once the optimal number of clusters is known, K-means is applied to group similar customers. This step creates cluster labels and attaches them to the original dataset. Here’s the code:

# Run k-means with 5 clusters
set.seed(123)  # Set seed for reproducibility
kmeans_model <- kmeans(mall_scaled, centers = 5, nstart = 25)  # Apply K-means with 5 clusters

# Add cluster assignments to original data
mall_data$Cluster <- as.factor(kmeans_model$cluster)  # Store cluster number as a new column

# View with cluster info
head(mall_data)  # See the first few rows including the new Cluster column

The output for the above code gives us a table:

	Gender	Age	Annual.Income..k..	Spending.Score..1.100.	Cluster
	<dbl>	<int>	<int>	<int>	<fct>
1	1	19	15	39	2
2	1	21	15	81	2
3	0	20	16	6	5
4	0	23	16	77	1
5	0	31	17	40	5
6	0	22	17	76	1

Step 11: Visualize the Customer Clusters

This step generates a cluster plot to visually understand how the customers are grouped. Each point represents a customer, and colors indicate their assigned cluster. The code is:

# Plot clusters
fviz_cluster(kmeans_model, data = mall_scaled, 
             ellipse.type = "convex",       # Draw convex shapes around clusters
             palette = "jco",               # Use a predefined color palette
             ggtheme = theme_minimal())     # Apply a clean minimal theme

The above code creates a cluster plot:

The above cluster graph shows that:’

Each point is a customer, plotted based on their features like Age, Income, and Spending, reduced to 2 dimensions (Dim1 & Dim2).
Different shapes and colors show 5 customer segments, grouped using clustering (like K-Means).
Clusters are well-separated, meaning customers within each group behave similarly and are different from other groups.
Cluster boundaries form distinct regions, helping visually understand how the customers are grouped.
Customer IDs are labeled, allowing you to track individual data points within each segment.

Must Read: 18 Types of Regression in Machine Learning You Should Know

Step 12: Analyze the Characteristics of Each Cluster

After forming clusters, this step helps summarize the average attributes of customers within each group. It gives insights into how different customer segments behave based on features like age, income, and spending. The code is:

# Summarize each cluster
mall_data %>%
  group_by(Cluster) %>%         # Group data by cluster
  summarise_all(mean)           # Calculate mean of each feature within each cluster

The output for this step is:

Cluster	Gender	Age	Annual.Income..k..	Spending.Score..1.100.
<fct>	<dbl>	<dbl>	<dbl>	<dbl>
1	0.0000000	28.39286	60.42857	68.17857
2	1.0000000	28.25000	62.00000	71.67500
3	1.0000000	55.90323	48.77419	38.80645
4	0.5483871	40.41935	90.00000	15.74194
5	0.0000000	49.14286	46.33333	39.61905

The above table shows that:

Clusters 1 & 2: Young customers (one all-female, one all-male) with moderate income and high spending – ideal for targeted marketing.
Cluster 3: Older males with lower income and average spending – less likely to spend big, but still active.
Cluster 4: Middle-aged, mixed-gender, high income but very low spending – untapped potential worth exploring.
Cluster 5: Older females with low income and moderate spending – consistent but limited spenders.

Conclusion

In this Mall Customer Segmentation project, we used the K-means clustering algorithm in R on Google Colab to segment customers based on their annual income and spending score.

After preprocessing the dataset, we visualized patterns using scatter plots and applied the Elbow Method to determine the optimal number of clusters, which was found to be 5. The K-means model successfully grouped customers into distinct segments. These segments can help businesses tailor marketing strategies, personalize offers, and improve customer targeting based on spending behavior and income levels.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1IpC3qGhGDMixBdNNLtZpQbuH0HCOQNsW#scrollTo=V7CT2Q_9Zkc1

Frequently Asked Questions (FAQs)

1. How does customer segmentation benefit retail businesses?

Customer segmentation enables businesses to personalize marketing campaigns, improve customer retention, and optimize store layouts and product placement. By targeting specific customer groups based on purchasing behavior or demographics, companies can enhance their overall strategy and revenue.

2. Can I perform this project without any prior machine learning experience?

Yes! This project is ideal for beginners. You’ll mainly use R libraries like ggplot2, dplyr, and cluster to visualize and analyze the data. The algorithms involved (like K-means) are easy to grasp and require no prior coding expertise.

3. What kind of dataset is suitable for customer segmentation?

A good dataset includes demographic information (age, gender, income) and behavioral attributes (spending score, purchase history). The more relevant features you have, the more accurate your segmentation will be. Public datasets from Kaggle are commonly used.

4. Which visualizations are helpful in understanding clusters?

Scatter plots with cluster color coding, elbow plots (for optimal K value), and silhouette scores are helpful. You can also use PCA-based 2D visualizations to reduce dimensionality and make the clusters easier to interpret visually.

5. What are some other beginner-friendly machine learning projects to try?

Here are 4 more projects you can explore:

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources