K-means clustering is one of the most commonly used techniques by data professionals. Due to the algorithm’s efficacy, it is demanded by numerous industries in various applications.
A data scientist’s job requires the implementation of Clustering in many stages. Many large-scale projects are currently based upon the clustering algorithm and have drastically raised the bar for the demand of data science professionals.
One of those algorithms is the K-means clustering, which is the basic idea of this article and its implementation with the MATLAB source code.
Before getting the topic’s hold, let’s have a quick look at what Clustering is, its significance, and how it can be implemented in real life. By the end of the post, you will come to know how crucial this algorithm is for understanding data in large sets.
What is Clustering?
Data is the most critical component for any application, and a cluster is nothing but an accumulation of similar data points combined. As the name clearly defines, Clustering is the process of dividing a large chunk of data into subgroups or only clusters based on the data pattern.
In machine learning, Clustering is applied when there is no predefined data available. The ultimate aim is to group data into classes with high Intra-class similarity.
Clustering is used to explore data. Some real-life examples where it can be used are in market segmentation to find customers with similar behaviours, image segmentation/compression, document clustering with multiple topics, etc.
It is a requisite step before processing data to identify homogeneous groups for building supervised models. K-Means clustering is an unsupervised learning algorithm as we have to look for data to integrate similar observations and form distinct groups.
Let’s take a look at the K-Means algorithm, which is one of the most applied and the simplest clustering algorithms.
K-means clustering is one of the most desired unsupervised machine learning algorithms.
Unsupervised algorithms make conclusions from datasets using input vectors without referring to labelled outcomes.
It is an iterative distance-based or centroid-based algorithm that segregates the dataset into K distinct subgroups (clusters) where each data point belongs to one group. The similarity of the intra-cluster data points is increased, and the distance between the clusters is kept optimum.
The distance between the data points and the centroid of the cluster is kept at a minimum, such as Euclidean distance. In K-Means, each cluster is linked to a centroid. The primary aim is to minimise the distances between the points and the respective cluster centroid.
FYI: Free nlp course!
How K-Means Clustering Works?
As the clustering process means several iterations to be performed, the K-Means algorithm has a unique way of working. Here is a step-by-step explanation of the way it works:
Step 1: Initially, define the number of clusters ‘K’.
Step 2: Initialise random K data points as centroids for each cluster.
If there are 2 clusters, the value of ‘K’ will be 2.
Step 3: Perform several iterations until the assigned data points to clusters do not change.
Step 4: Calculate the sum of the squared distance between data points and the centroids.
Step 5: Allocate each data point to the closest cluster (centroid) to minimise the distance.
Step 6: Take an average of the centroids of the clusters belonging to each other.
This is a single iteration process performed for computing the centroid and assigning the points to the cluster based on their distance from the centroid. Once all the centroids are defined, the process is stopped.
An Illustrative Example Depicting the Implementation of K-Means Clustering
Statement: One of the famous food chains, McDonald’s wants to open a chain of outlets across California and want to find out the locations that will fetch them maximum revenue.
What McDonald’s already Has?
Ø A strong e-commerce presence
Ø Online customer data for analysing locations from where the orders are made frequently
Possible challenges they could face
- Analyzing the areas from where the orders are made frequently.
- Comprehend how many outlets to be opened in the area
- Figure out the locations for the outlets within all areas to keep a minimum distance between the store and delivery points.
All these points need a lot of analysis and mathematics to work on.
How can the K-means Clustering Method be used here?
With a predefined value of K, the K-means algorithm can be implemented in the following steps:
- Identifying the store locations with K Partition of objects into K non-empty subsets.
- Determining the cluster centroids of the partition.
- Assigning each location to a specific cluster.
- Calculating the distances from each location and allocate points to the cluster where the distance is minimum with the outlet.
- After one iteration, re-allotting the points, find the centroid of the new cluster formed.
Likewise, the K-Means Clustering algorithm can be applied to a variety of applications in varied scales. The hospitality industry, crime investigation departments, and image resizing, to name a few.
K-Means algorithm is implemented using many languages such as R, Python, MATLAB, etc. In the next section, we will look at how K-Means Clustering MATLAB is applied.
K-Means Algorithm Using MATLAB
K-Means is a largely used algorithm used by many professionals dealing with data science, machine learning, artificial intelligence, cryptography, and cybersecurity.
The core objective of using this algorithm is to find out the centroid of each cluster. The data given to a programmer is heterogeneous. Here is the MATLAB code for plotting the centroid of each cluster and assign the coordinates of each centroid:
rng default; % For reproducibility
X = [randn(100,2)*0.75+ones(100,2);
legend(‘Cluster 1′,’Cluster 2′,’Cluster 3′,’Cluster 4′,’Centroids’, ‘Location’,’NW’);
title(‘Cluster Assignments and centroids’);
for i=1:size(C, 1)
display([‘Centroid ‘, num2str(i), ‘: X1 = ‘, num2str(C(i, 1)), ‘; X2 = ‘, num2str(C(i, 2))]);
MATLAB Window Showing Four Clusters and Respective Centroids
The centroids obtained are as follows:
- The value of X1 & X2 for Centroid 1: 1.3661; 1.7232
- The value of X1 & X2 for Centroid 2: -1.015; -1.053
- The value of X1 & X2 for Centroid 3: 1.6565; 0.36376
- The value of X1 & X2 for Centroid 4: 0.35134; 0.85358
Some business areas where K-Means clustering can be implemented
K-means clustering is a versatile algorithm and can be used for many business use cases for any type of grouping. Some examples are:
Ø Behavioral Segregation:
- Division using purchase history
- Division using application, website, or platform activities
- Identify customers’ image based on their interests
- Profile creation with monitoring activities
Ø Image Scaling
- Image compression using Python
Ø Sensor measurements:
- Detect motion sensors activity types
- Group images
- Divide audio
- Spot health monitoring groups
Ø Determine bots or anomalies:
- Separate activity groups from bots
- Make a group of valid activities to clean up outlier detection
Ø Inventory classification:
- Make inventory groups by sales activity
- Make inventory groups by manufacturing metrics
Advantages of K-Means Clustering
There’s a reason why top professionals prefer the K-Means clustering algorithm. Some benefits it offers:
- It is a fast, robust, and easier to understand the algorithm.
- The end-efficiency is relatively high
- Offers phenomenal results when data sets are different from each other. For higher variables values, K-Means works comparatively quicker
- The clusters produced with K-Means are relatively tighter than other clustering methods.
Must Read: MATLAB Data Types
K-means clustering is a broadly used approach for analysing data clusters. Once you gain command, it is easier to understand and apply and deliver results quickly.
We hope with this article; we could introduce you to this analysis technique. For any queries regarding the K-means algorithm, feel free to comment below.
Further, if this field of study interests you, have a look at our PG Diploma in Machine Learning and AI program which is specially curated for working professionals offering 30+ case studies & assignments, 25+ mentorship sessions from industry experts, 10 Practical Hands-on Capstone Projects, 450+ hours of learning and placement assistance.
What is K Means clustering in machine learning?
This is a popular clustering algorithm used in unsupervised machine learning. K Means algorithm works on the principle of identification of K centroids randomly. From the next step, the algorithm tries to maximize the overall within cluster distance and also minimize the overall between cluster distance. K Means algorithm is an iterative approach. In each iteration, it selects the K Means from the current set of centroids. The algorithm then assigns each observation to the closest K Mean. The distance between two clusters is computed based on the distance between the two closest observations. The Centroid of a cluster is defined as the average of all the observations in the cluster.
What are the limitations of the K Means clustering algorithm?
There are some limitations of K Means that you will want to keep in mind when using it. K Means is not robust to outliers. The K Means algorithm only works well when all of your data points are approximately the same distance from the centroid. If some of your data points are far away from the centroid, this will bias the assignment of other data points to clusters. K Means does not guarantee a unique solution. If you have more than one cluster of points, there is no guarantee that K Means will return the same number of clusters each time the algorithm is run. K Means converges slowly. The algorithm converges very slowly, even on small datasets.
What are the advantages of K Means clustering?
It is effective for both single and multiple dimensions. It is applicable in both two and three dimensions. It is particularly useful in situations where there are many clusters. The clusters are obtained at the mid-point of the data points. A mean value is calculated for each cluster. Each point is divided by the standard deviation and then it is compared to the mean value. The mean value and the standard deviation are calculated for all clusters and points.