K-means clustering is one of the most commonly used techniques by data professionals. Due to the algorithm’s efficacy, it is demanded by numerous industries in various applications.
A data scientist’s job requires the implementation of Clustering in many stages. Many large-scale projects are currently based upon the clustering algorithm and have drastically raised the bar for the demand of data science professionals.
One of those algorithms is the K-means clustering, which is the basic idea of this article and its implementation with the MATLAB source code.
Before getting the topic’s hold, let’s have a quick look at what Clustering is, its significance, and how it can be implemented in real life. By the end of the post, you will come to know how crucial this algorithm is for understanding data in large sets.
What is Clustering?
Data is the most critical component for any application, and a cluster is nothing but an accumulation of similar data points combined. As the name clearly defines, Clustering is the process of dividing a large chunk of data into subgroups or only clusters based on the data pattern.
In machine learning, Clustering is applied when there is no predefined data available. The ultimate aim is to group data into classes with high Intra-class similarity.
Clustering is used to explore data. Some real-life examples where it can be used are in market segmentation to find customers with similar behaviours, image segmentation/compression, document clustering with multiple topics, etc.
It is a requisite step before processing data to identify homogeneous groups for building supervised models. K-Means clustering is an unsupervised learning algorithm as we have to look for data to integrate similar observations and form distinct groups.
Let’s take a look at the K-Means algorithm, which is one of the most applied and the simplest clustering algorithms.
K-means clustering is one of the most desired unsupervised machine learning algorithms.
Unsupervised algorithms make conclusions from datasets using input vectors without referring to labelled outcomes.
It is an iterative distance-based or centroid-based algorithm that segregates the dataset into K distinct subgroups (clusters) where each data point belongs to one group. The similarity of the intra-cluster data points is increased, and the distance between the clusters is kept optimum.
The distance between the data points and the centroid of the cluster is kept at a minimum, such as Euclidean distance. In K-Means, each cluster is linked to a centroid. The primary aim is to minimise the distances between the points and the respective cluster centroid.
How K-Means Clustering Works?
As the clustering process means several iterations to be performed, the K-Means algorithm has a unique way of working. Here is a step-by-step explanation of the way it works:
Step 1: Initially, define the number of clusters ‘K’.
Step 2: Initialise random K data points as centroids for each cluster.
If there are 2 clusters, the value of ‘K’ will be 2.
Step 3: Perform several iterations until the assigned data points to clusters do not change.
Step 4: Calculate the sum of the squared distance between data points and the centroids.
Step 5: Allocate each data point to the closest cluster (centroid) to minimise the distance.
Step 6: Take an average of the centroids of the clusters belonging to each other.
This is a single iteration process performed for computing the centroid and assigning the points to the cluster based on their distance from the centroid. Once all the centroids are defined, the process is stopped.
An Illustrative Example Depicting the Implementation of K-Means Clustering
Statement: One of the famous food chains, McDonald’s wants to open a chain of outlets across California and want to find out the locations that will fetch them maximum revenue.
What McDonald’s already Has?
Ø A strong e-commerce presence
Ø Online customer data for analysing locations from where the orders are made frequently
Possible challenges they could face
- Analyzing the areas from where the orders are made frequently.
- Comprehend how many outlets to be opened in the area
- Figure out the locations for the outlets within all areas to keep a minimum distance between the store and delivery points.
All these points need a lot of analysis and mathematics to work on.
How can the K-means Clustering Method be used here?
With a predefined value of K, the K-means algorithm can be implemented in the following steps:
- Identifying the store locations with K Partition of objects into K non-empty subsets.
- Determining the cluster centroids of the partition.
- Assigning each location to a specific cluster.
- Calculating the distances from each location and allocate points to the cluster where the distance is minimum with the outlet.
- After one iteration, re-allotting the points, find the centroid of the new cluster formed.
Likewise, the K-Means Clustering algorithm can be applied to a variety of applications in varied scales. The hospitality industry, crime investigation departments, and image resizing, to name a few.
K-Means algorithm is implemented using many languages such as R, Python, MATLAB, etc. In the next section, we will look at how K-Means Clustering MATLAB is applied.
K-Means Algorithm Using MATLAB
K-Means is a largely used algorithm used by many professionals dealing with data science, machine learning, artificial intelligence, cryptography, and cybersecurity.
The core objective of using this algorithm is to find out the centroid of each cluster. The data given to a programmer is heterogeneous. Here is the MATLAB code for plotting the centroid of each cluster and assign the coordinates of each centroid:
rng default; % For reproducibility
X = [randn(100,2)*0.75+ones(100,2);
legend(‘Cluster 1′,’Cluster 2′,’Cluster 3′,’Cluster 4′,’Centroids’, ‘Location’,’NW’);
title(‘Cluster Assignments and centroids’);
for i=1:size(C, 1)
display([‘Centroid ‘, num2str(i), ‘: X1 = ‘, num2str(C(i, 1)), ‘; X2 = ‘, num2str(C(i, 2))]);
MATLAB Window Showing Four Clusters and Respective Centroids
The centroids obtained are as follows:
- The value of X1 & X2 for Centroid 1: 1.3661; 1.7232
- The value of X1 & X2 for Centroid 2: -1.015; -1.053
- The value of X1 & X2 for Centroid 3: 1.6565; 0.36376
- The value of X1 & X2 for Centroid 4: 0.35134; 0.85358
Some business areas where K-Means clustering can be implemented
K-means clustering is a versatile algorithm and can be used for many business use cases for any type of grouping. Some examples are:
Ø Behavioral Segregation:
- Division using purchase history
- Division using application, website, or platform activities
- Identify customers’ image based on their interests
- Profile creation with monitoring activities
Ø Image Scaling
- Image compression using Python
Ø Sensor measurements:
- Detect motion sensors activity types
- Group images
- Divide audio
- Spot health monitoring groups
Ø Determine bots or anomalies:
- Separate activity groups from bots
- Make a group of valid activities to clean up outlier detection
Ø Inventory classification:
- Make inventory groups by sales activity
- Make inventory groups by manufacturing metrics
Advantages of K-Means Clustering
There’s a reason why top professionals prefer the K-Means clustering algorithm. Some benefits it offers:
- It is a fast, robust, and easier to understand the algorithm.
- The end-efficiency is relatively high
- Offers phenomenal results when data sets are different from each other. For higher variables values, K-Means works comparatively quicker
- The clusters produced with K-Means are relatively tighter than other clustering methods.
Must Read: MATLAB Data Types
K-means clustering is a broadly used approach for analysing data clusters. Once you gain command, it is easier to understand and apply and deliver results quickly.
We hope with this article; we could introduce you to this analysis technique. For any queries regarding the K-means algorithm, feel free to comment below.
Further, if this field of study interests you, have a look at our PG Diploma in Machine Learning and AI program which is specially curated for working professionals offering 30+ case studies & assignments, 25+ mentorship sessions from industry experts, 10 Practical Hands-on Capstone Projects, 450+ hours of learning and placement assistance.