Home
Blog
Artificial Intelligence
What is Centroid Based Clustering? Implementation, Variations & Applications

What is Centroid Based Clustering? Implementation, Variations & Applications

Q: 1. How does centroid-based clustering work in customer churn prediction for telecom companies?

Centroid-based clustering can be applied to customer churn prediction by grouping customers with similar behaviors, such as usage patterns, payment history, and customer service interactions. By identifying clusters of high-risk customers, telecom companies can proactively target them with retention strategies. For example, a telecom provider can use K-Means clustering to segment customers who frequently complain about service quality and offer tailored promotions or support, reducing churn rates.

Q: 2. How can centroid-based clustering be applied in disease outbreak prediction?

In healthcare, centroid-based clustering can help identify patterns in patient data, such as symptoms, demographic information, and geographical location, to predict potential disease outbreaks. For example, K-Means clustering can be used to group regions with similar infection rates, enabling health authorities to allocate resources and monitor high-risk areas more effectively. This method provides actionable insights for early intervention and containment of infectious diseases.

Q: 3. Is centroid based clustering suitable for text data?

Centroid-based clustering can be used for text data by first converting text into numerical representations, such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings. Once the text data is vectorized, you can apply K-Means or other centroid based methods to cluster documents by similarity. However, for more complex textual relationships, consider using Latent Dirichlet Allocation (LDA) or other topic modeling techniques.

Q: 4. How do I handle categorical data with centroid based clustering?

Centroid-based clustering works best with numerical data, but you can handle categorical data by transforming it into numerical values using techniques like one-hot encoding or label encoding. However, these transformations can result in high-dimensional data, so dimensionality reduction techniques like PCA might be needed to improve the clustering results in centroid based clustering in data mining for categorical data.

Q: 5. How do centroid based clustering algorithms compare with hierarchical clustering?

Centroid-based clustering (like K-Means) is typically faster and more scalable than hierarchical clustering, which builds a tree of clusters. However, hierarchical clustering does not require you to pre-define the number of clusters (K) and provides a more detailed view of cluster relationships. Centroid-based clustering, on the other hand, is more efficient for large datasets and well-suited for data mining tasks requiring scalability.

Q: 6. Can centroid based clustering handle imbalanced clusters?

Centroid-based clustering can struggle with imbalanced clusters, as larger clusters may dominate the centroid calculation, causing smaller clusters to be poorly represented. To address this, you can use K-Medoids or DBSCAN, which are less sensitive to imbalances. Alternatively, adjusting the K value or using weighted clustering techniques can help improve the representation of imbalanced clusters in centroid based clustering.

Q: 7. What are the implications of poor centroid initialization in clustering?

Poor centroid initialization can lead to suboptimal clustering results, with centroids converging to local minima rather than the global optimal solution. This can result in inaccurate clusters and affect decision-making. Using methods like K-Means++ for centroid initialization can help reduce this risk by spreading the centroids more effectively at the start of the clustering process.

Q: 8. Can centroid based clustering be applied to image segmentation?

Yes, centroid based clustering is often used for image segmentation. By treating pixel colors or intensities as data points, you can use K-Means to group similar pixels together, effectively segmenting an image into regions of similar colors or textures. This technique is commonly used in computer vision tasks, such as object detection and facial recognition.

Q: 9. How can you apply centroid based clustering to marketing campaigns?

In marketing, centroid based clustering can be used to group customers by behavior, demographics, or purchasing patterns. This segmentation allows businesses to tailor marketing strategies to different customer groups. For example, retailers can use clustering to identify high-value customers and create targeted promotions, improving engagement and sales.

Q: 10. How does centroid based clustering handle changes in data over time?

Centroid-based clustering can struggle with dynamic data that evolves over time, as new data points may shift the cluster centroids significantly. In such cases, online K-Means or Mini-Batch K-Means can be used, which allows for incremental updates to the centroids without reprocessing the entire dataset, making it more suitable for real-time applications.

By Mukesh Kumar

Updated on May 09, 2025 | 28 min read | 1.77K+ views

Table of Contents

View all

What is Centroid Based Clustering? Key Concepts and Types
What is K-Means Clustering? Implementation and Evaluation
Advanced Variations and Extensions of K-Means Clustering
Advantages and Limitations of Centroid Based Clustering in Data Mining
Real Life Applications of Clustering in Machine Learning
Test Your Knowledge on Centroid Based Clustering!
Become an Expert at Clustering with upGrad!

Did you know? The term "K-Means" was first coined by James MacQueen in 1967, but the concept dates back to Hugo Steinhaus in 1956. The algorithm was later popularized by Stuart Lloyd and Edward Forgy in the 1950s and 1960s, leading to the development of the K-Means method.

Centroid-based clustering is a method where data points are grouped based on their similarity to a central point, called the centroid. The problem? Choosing the right technique and understanding its variations can be tricky.

In this tutorial, you’ll learn how centroid based clustering works, explore its different forms, and discover how it applies to real-life problems.

Improve your machine learning skills with upGrad’s online AI and ML courses. Specialize in cybersecurity, full-stack development, game development, and much more. Take the next step in your learning journey!

What is Centroid Based Clustering? Key Concepts and Types

Clustering is a fundamental technique in unsupervised machine learning where data points are grouped based on their similarities. The objective is to identify inherent patterns or structures within the data without predefined labels.

Working with centroid-based clustering goes beyond simply applying the algorithm. To make the most of it, it's essential to focus on data preprocessing, adjusting the number of clusters, and accurately interpreting the results of the clustering. Here are three programs that can help you sharpen these skills:

In centroid-based clustering, each cluster is represented by a central point known as the centroid, that acts as the "average" of all data points in that cluster. This approach works well for partitioning data into distinct groups where each cluster can be described by its central point, simplifying the analysis and interpretation of complex datasets.

There are two main types of clustering:

Hierarchical Clustering: Builds a tree-like structure, called a dendrogram, where data points are grouped progressively. It can be:
- Agglomerative (bottom-up approach): Starts with individual points and merges clusters.
- Divisive (top-down approach): Starts with all points in one cluster and divides them into smaller clusters.
Partitioning/Centroid-Based Clustering: Divides the data into a predefined number of clusters. The most common method is K-Means, where clusters are formed around centroids. Other variants include:
- K-Medoids: Uses actual data points as cluster centers.
- Mini-Batch K-Means: Uses small random samples for faster clustering on large datasets.

The mathematical foundation behind centroid based clustering in data mining is key to its simplicity and effectiveness. By minimizing the distance between data points and their respective centroids, it ensures that the clusters are as compact and well-separated as possible.

Also Read: What is Cluster Analysis in Data Mining? Methods, Benefits, and More

Mathematical Foundation of Clustering

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program12 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree18 Months

At its core, the mathematical foundation of clustering focuses on how data points are grouped based on their similarities, often by minimizing a specific distance measure. Understanding this foundation helps you grasp how algorithms like K-Means and K-Medoids work to define clusters with precision.

Distance Metrics

Distance metrics are the backbone of clustering algorithms, as they define how "similar" or "distant" two data points are from each other. Different clustering algorithms rely on various distance measures to group data points.

Euclidean Distance: The most commonly used metric, especially in algorithms like K-Means. It measures the straight-line distance between two points in a Euclidean space.
- Formula:
  
  $d (p, q) = \sqrt{\sum_{i = 1}^{n} (p_{i} - q_{i})^{2}}$

Example: In a 2D space, it calculates the straight-line distance between two points (x1, y1) and (x2, y2).

Manhattan Distance: Also known as "city block distance", it measures the sum of absolute differences along each dimension.
- Formula:
  
  $d (p, q) = \sum_{i = 1}^{n} | p_{i} - q_{i} |$

Example: In a grid-like city, it calculates the total number of blocks you’d walk to get from one point to another.

Cosine Similarity: Measures the cosine of the angle between two vectors, useful for text data or when the magnitude of vectors doesn’t matter, just the direction.
- Formula:
  
  $cosine similarity = \frac{(A \cdot B)}{| | A | | | | B | |}$

Example: Used in document clustering, where text documents are represented as vectors.

Minkowski Distance: A generalization of both Euclidean and Manhattan distances, which includes a parameter "p" to adjust the type of distance measure.
- Formula:
  
  $d (p, q) = {(\sum_{i = 1}^{n} | p_{i} - q_{i} |^{p})}^{\frac{1}{p}}$
Example: When p=1, it’s Manhattan distance; when p=2, it’s Euclidean distance.

These distance metrics help determine how "close" points are to each other and guide the assignment of points to clusters in algorithms like K-Means or DBSCAN.

Centroids

A centroid is the central point in a cluster, representing the "average" of all the points within that cluster. The centroid is a key concept in centroid based clustering methods like K-Means.

Role in Clustering: In K-Means, the centroid is used to assign points to clusters and to represent each cluster. The algorithm iterates by adjusting the centroid based on the current members of the cluster, aiming to minimize the distance between points and their centroid.
Calculation: For a cluster with n points, the centroid is the arithmetic mean of the coordinates of the points.
- Formula for a 2D cluster:
  
  $C_{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}; C_{Y} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i}$

Where C_x and C_yare the coordinates of the centroid, and xi, yi are the coordinates of individual points in the cluster.

Also Read: What is Logistic Regression in Machine Learning?

Objective Function

The objective function is what clustering algorithms optimize to form meaningful clusters. In centroid based clustering, the goal is typically to minimize the distance between data points and their corresponding centroids.

It measures the "quality" of the clustering by quantifying how well the data points fit into their assigned clusters. It helps the algorithm decide when it has found an optimal solution.

K-Means Objective Function: In K-Means clustering, the objective function is the Sum of Squared Errors (SSE), which calculates the total distance between each data point and its assigned centroid.

The algorithm minimizes this value by adjusting centroids and reassigning points.

Formula:

$J = \sum_{i = 1}^{k} \sum_{x j \subset c i}^{} | | x_{j} - μ_{i} | |^{2}$
Where:
- J is the objective function (SSE),
- K is the number of clusters,
- x_j is a data point in cluster c_i,
- μ_i is the centroid of cluster c_i,
- ∣∣x_j−μ_i∣∣ is the distance between data point x_j and the centroid μi\mu_iμi.

Minimizing the Sum of Squared Errors (SSE) ensures that each cluster is as tight as possible, meaning the points within a cluster are as close to the centroid as possible. This results in better-defined clusters, making it easier to interpret and analyze the data.

Understanding how this mathematical principle works sets the stage for learning about K-Means, as it directly applies the concept of centroids and SSE to partition data into meaningful groups.

Also Read: Maths for Machine Learning Specialisation

What is K-Means Clustering? Implementation and Evaluation

K-Means is a popular partitioning clustering algorithm used to divide a dataset into K distinct clusters. The objective is to minimize the Sum of Squared Errors (SSE), ensuring that data points within a cluster are as close as possible to the cluster's centroid. The algorithm operates in an iterative process:

Initialization: Randomly select K centroids from the dataset.
Assignment Step: Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance).
Update Step: Recalculate the centroids by finding the mean of all data points assigned to each cluster.
Repeat the assignment and update steps until convergence, meaning the centroids no longer change.

In the update step, centroids are recalculated as the mean of all points within their cluster. These steps are repeated until the algorithm converges, meaning the centroids no longer change significantly. The process of convergence is crucial in ensuring that the clustering model has stabilized.

Convergence and Stopping Criteria dictate when the algorithm stops its iterations. This happens when the centroids no longer shift or when a predefined maximum number of iterations is reached.

By enforcing these criteria, we ensure that the final clusters are as optimal as possible, based on the defined objective function, leading to a stable and accurate model.

Also Read: Gradient Descent Algorithm: Methodology, Variants & Best Practices

Practical Considerations in K-Means

In K-Means, several practical factors can directly affect the quality and reliability of your clustering results. By addressing these aspects, you can avoid suboptimal clusters and ensure that the algorithm produces meaningful, accurate groupings.

Choosing the Right K (Number of Clusters)

One of the most important aspects of K-Means is selecting the right number of clusters, K. Choosing too few clusters can oversimplify the data, while choosing too many can lead to overfitting.

The following methods help in determining the best value for K:

Elbow Method: This technique involves plotting the Sum of Squared Errors (SSE) for various values of K and observing where the curve bends or flattens out (the "elbow"). The point at which the SSE starts decreasing at a slower rate is the ideal K.

Example: In customer segmentation, if the elbow occurs at K=3, it suggests that three clusters best represent the customers' purchasing behaviors.

Silhouette Score: Measures how close each point in one cluster is to the points in neighboring clusters. A higher score indicates well-separated and dense clusters.

Example: If your data is split into clusters of high-value and low-value customers, a higher silhouette score would indicate that these groups are distinct and well-defined.

Gap Statistic: This method compares the clustering result with that of a random dataset. A large gap between the real and random clustering suggests a good clustering result.

Example: If you're clustering images based on their similarity, a large gap would indicate a well-chosen K, as the real clusters differ greatly from random groupings.

Also Read: What is Overfitting & Underfitting In Machine Learning ? [Everything You Need to Learn]

Initial Centroid Selection

The initialization of centroids can greatly affect the clustering outcome. Poor initial centroids can lead to local minima, where the algorithm converges prematurely without finding the optimal clustering.

Random Initialization Problem: Randomly selecting initial centroids can sometimes place them too close to each other, causing poor clustering results and slower convergence.

Example: If you’re clustering data for product recommendations and start with centroids close to each other, the algorithm might incorrectly group diverse products together, leading to inaccurate recommendations.

K-Means++: A more sophisticated method for selecting initial centroids. It spreads out the initial centroids, reducing the likelihood of poor initialization and improving the final results.

Example: In a dataset of geographical locations, K-Means++ would ensure that the initial centroids are spread across the map, leading to more accurate clustering of regions with distinct characteristics.

Outliers

Outliers can significantly distort the clustering results, as they pull centroids away from the true center of the data.

Impact of Outliers: Outliers affect the mean of the cluster, shifting the centroid and leading to poorly defined clusters.

Example: If you’re clustering employees based on salary and experience, outliers like a few extremely high earners could skew the results and place them in the wrong cluster, causing inaccurate groupings of similar employees.

Handling Outliers: You can either remove outliers from the dataset or use clustering methods that are more resistant to them, like K-Medoids, which uses actual data points as centroids instead of the mean.

Example: For customer segmentation, removing extreme outliers (such as a customer who makes an unusually large purchase once) would help in forming clusters that better represent typical customer behaviors.

Understanding and addressing these practical considerations in K-Means ensures that you can avoid common pitfalls and make the most of the algorithm's potential.

Also Read: Outlier Analysis in Data Mining: Techniques, Detection Methods, and Best Practices

Next, let’s put these concepts into action and see how K-Means can help you efficiently cluster your data and gain meaningful insights.

Practical Example: Implementing K-Means

Implementing K-Means is especially beneficial because it is computationally efficient and works well with datasets where clusters are roughly spherical and well-separated. With this hands-on example, you'll learn how to apply K-Means to segment your data, find patterns, and generate actionable insights.

Step 1: Install Required Libraries

First, ensure you have the required libraries installed. You can install them using pip if you don't have them yet:

pip install scikit-learn matplotlib numpy

Step 2: Import Libraries

Next, import the necessary libraries for data manipulation, clustering, and visualization.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

Numpy is for numerical operations.
matplotlib.pyplot is for plotting graphs.
KMeans is the clustering algorithm from scikit-learn.
make_blobs is used to generate synthetic data for clustering.

Struggling with data manipulation and visualization? Check out upGrad’s free Learn Python Libraries: NumPy, Matplotlib & Pandas course. Gain the skills to handle complex datasets and create powerful visualizations. Start learning today!

Step 3: Generate Synthetic Data

For this example, we’ll create a simple synthetic dataset using make_blobs. This function generates clusters of points for us to cluster.

# Create a synthetic dataset with 2 features and 3 clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

This generates 300 data points divided into 3 clusters.
Step 4: Apply K-Means Clustering
Now, let’s apply the K-Means algorithm to this data. We’ll set K=3 since we know the data has 3 clusters.

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
# Get the cluster centroids
centroids = kmeans.cluster_centers_
# Get the labels (cluster assignments for each data point)
labels = kmeans.labels_

Here:

n_clusters=3 specifies that we want to divide the data into 3 clusters.
fit(X) runs the K-Means algorithm on the data.
cluster_centers_ gives the coordinates of the centroids of the clusters.
labels_ contains the cluster assignment for each data point.

Step 5: Visualize the Clusters

Let’s plot the data points and their respective cluster centroids to visualize the result of the clustering.

# Plot the data points and the centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.6)  # Data points
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X')  # Centroids
plt.title("K-Means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

The data points are colored based on their cluster assignment using the c=labels argument.
The centroids are marked with red 'X' symbols.

Output:

Step 6: Evaluate the Results

We can use the inertia attribute of the K-Means object to evaluate how well the clusters were formed. Inertia measures the total distance between the data points and their respective centroids. A lower inertia value indicates better clustering.

# Print the inertia (sum of squared distances to the closest centroid)
print(f"Inertia: {kmeans.inertia_}")

After implementing K-Means, the next step is to experiment with different values of K to find the optimal number of clusters using methods like the Elbow Method or Silhouette Score.

Once you've fine-tuned your clustering model, the next crucial step is Evaluating Clustering Performance to ensure the quality of your clusters.

Also Read: Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025

Evaluating Clustering Performance

Evaluating the performance of your clustering model is essential to ensure the quality and validity of the results. Without proper evaluation, you risk drawing incorrect conclusions from poorly defined clusters. For example, if you're segmenting customers for targeted marketing, poorly defined clusters could lead to ineffective campaigns that miss the mark.

Internal Evaluation Metrics

Internal evaluation metrics assess the quality of your clustering by looking at the structure and coherence of the clusters themselves. These metrics do not require any external labels, making them ideal for unsupervised learning.

Silhouette Score

The Silhouette Score measures how similar each point is to its own cluster compared to other clusters. A score close to +1 means the point is well-clustered, while a score close to -1 suggests the point might be incorrectly assigned. Used by Netflix to evaluate how well the user segments created for personalized recommendations are defined and distinct.

Why it matters: It gives you a clear indication of how well the data points fit within their clusters.

How to calculate:

from sklearn.metrics import silhouette_score
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score}")

Davies-Bouldin Index

The Davies-Bouldin Index evaluates cluster quality by comparing the average distance between the clusters to their internal cohesion. A lower score indicates better clustering. Applied by Amazon to measure the quality of customer clusters in order to tailor targeted marketing campaigns.

Why it matters: It balances both the compactness of clusters and their separation.

How to calculate:

from sklearn.metrics import davies_bouldin_score
db_score = davies_bouldin_score(X, labels)
print(f"Davies-Bouldin Index: {db_score}")

Dunn Index

The Dunn Index identifies clusters that are well-separated and internally compact. A higher value indicates better clustering. Used by Spotify to assess the separation and cohesion of music genre clusters, ensuring better music recommendations based on user preferences.

Why it matters: It focuses on the distance between clusters relative to the size of the clusters.

How to calculate: This is less straightforward and typically requires custom implementation, but it’s useful for comparing different clustering configurations.

D = max(Intra-cluster distance) / min(Inter-cluster distance)

Inter-cluster distance: The distance between the closest points from different clusters.
Intra-cluster distance: The maximum distance between points within the same cluster.

A higher Dunn Index indicates better clustering, with well-separated and compact clusters.

To calculate it, you need to compute pairwise distances between all clusters and their members. It’s not directly available in scikit-learn, but custom code can be written to calculate it.

Here's a rough idea of how you might implement it:

from sklearn.metrics import pairwise_distances
import numpy as np
def dunn_index(X, labels):
   # Calculate the pairwise distances between all points
   pairwise_dist = pairwise_distances(X)
   
   # Initialize variables
   min_intercluster_distance = np.inf
   max_intracluster_distance = -np.inf
   
   # Loop through each cluster and compute distances
   for i in range(len(set(labels))):
       # Points in the current cluster
       cluster_points = X[labels == i]
       
       # Intra-cluster distances (max distance within the cluster)
       max_intracluster_distance = max(max_intracluster_distance, np.max(pairwise_dist[labels == i][:, labels == i]))
       
       # Inter-cluster distances (min distance between clusters)
       for j in range(i+1, len(set(labels))):
           cluster_j_points = X[labels == j]
           inter_distance = np.min(pairwise_dist[labels == i][:, labels == j])
           min_intercluster_distance = min(min_intercluster_distance, inter_distance)
   
   # Return Dunn Index
   return min_intercluster_distance / max_intracluster_distance

External Evaluation Metrics

External evaluation metrics compare your clustering results against a known ground truth. These metrics are particularly useful when you have labeled data available, as they provide an objective measure of how well the clustering algorithm matched the expected results.

They help you validate your clustering performance and ensure the model's output is meaningful.

Also Read: What are Sklearn Metrics and Why You Need to Know About Them?

Adjusted Rand Index (ARI)

The ARI measures how similar your clustering is to a ground truth classification. It accounts for chance, making it a more reliable comparison. A score close to 1 indicates perfect agreement with the true labels. Used by Google News to compare clustering results of news articles with actual topics, helping improve content categorization.

Why it matters: It lets you compare clustering results to known ground truth.

How to calculate:

from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Index: {ari}")

Normalized Mutual Information (NMI)

NMI quantifies the amount of shared information between your clustering and the ground truth. A value of 1 means the clustering is identical to the true labels, while 0 means there is no information shared. Applied by Twitter to assess the similarity between user clusters based on their activity and engagement, helping improve ad targeting.

Why it matters: It’s a good measure when you want to quantify the similarity between the predicted clusters and true labels.

How to calculate:

from sklearn.metrics import normalized_mutual_info_score
nmi = normalized_mutual_info_score(true_labels, labels)
print(f"Normalized Mutual Information: {nmi}")

Visualizing Clusters

Visualizing clusters is an effective way to understand the results of clustering algorithms. By reducing the dimensions of your data (using methods like t-SNE or PCA), you can get a clearer, more intuitive sense of how well your data has been grouped.

Visualization helps to identify patterns, outliers, and potential improvements for the clustering model.

t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE is a dimensionality reduction technique that helps visualize high-dimensional data by reducing it to two or three dimensions. It is particularly useful for visualizing clusters. Used by Amazon to visualize customer behavior clusters, helping improve product recommendations and marketing strategies.

Why it matters: It helps you understand the spatial distribution of your clusters, especially in complex datasets.

How to visualize:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
plt.title("t-SNE Visualization")
plt.show()

PCA (Principal Component Analysis)

PCA is another technique for reducing the dimensionality of your data while preserving variance. It is often used to plot the data in 2D or 3D for easier visualization of clusters. Applied by Facebook to reduce the dimensions of user interaction data for efficient clustering and targeted content delivery.

Why it matters: PCA helps identify the most important dimensions of your data and shows how clusters are distributed in these dimensions.

How to visualize:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.title("PCA Visualization")
plt.show()

After evaluating the clustering performance, you can experiment with different clustering algorithms to see how they compare to K-Means. Try applying K-Medoids or Mini-Batch K-Means for larger datasets.

Also Read: Introduction to Classification Algorithm: Concepts & Various Types

Explore how variations like Gaussian Mixture Models (GMM) work in handling more complex cluster shapes. Let's dive deeper into these advanced variations and their applications.

Advanced Variations and Extensions of K-Means Clustering

Advanced variations and extensions of K-Means clustering address its limitations, making it more versatile and applicable to a wider range of data. These methods improve K-Means by enhancing its efficiency, scalability, and ability to handle more complex datasets.

For example, K-Medoids deals with outliers better, while Mini-Batch K-Means speeds up the algorithm for large datasets.

K-Medoids

K-Medoids is a variation of K-Means clustering that uses actual data points as the centroids (medoids) of clusters, rather than calculating the mean of the data points. This method is more robust to outliers and is ideal for datasets with noisy or non-numeric data.

Unlike K-Means, K-Medoids minimizes the sum of dissimilarities between points and medoids, rather than the squared Euclidean distance.

Here's a quick comparison between K-Means and K-Medoids to highlight the key differences:

Unlike K-Means, K-Medoids minimizes the sum of dissimilarities between points and medoids, rather than the squared Euclidean distance.

Here's a quick comparison between K-Means and K-Medoids to highlight the key differences:

Aspect	K-Means	K-Medoids
Centroid Calculation	Uses the mean of data points	Uses actual data points (medoids)
Sensitivity to Outliers	Sensitive to outliers	More robust to outliers
Data Types	Primarily works with numerical data	Can work with any data type (e.g., categorical, numeric)
Computational Efficiency	Generally faster for large datasets	Slower, especially with large datasets
Cluster Shape	Assumes spherical clusters	Can handle non-spherical clusters

K-Medoids is particularly beneficial when dealing with datasets that include outliers or categorical data, where the mean might not represent the "center" of the data well.

K-Medoids operates similarly to K-Means, but instead of using the mean of the data points in a cluster, it selects an actual data point as the cluster’s centroid (medoid). Here's a breakdown of how it works:

Initialize K medoids: Randomly select K data points from the dataset as the initial medoids.
Assign points to the nearest medoid: For each data point, calculate the distance to each medoid, and assign the point to the closest medoid.
Update the medoids: For each cluster, calculate the data point that minimizes the total distance to all other points in the cluster, and set it as the new medoid.
Repeat the assignment and update steps until convergence (no change in medoids).

Here’s a simple Python implementation using the PAM (Partitioning Around Medoids) method for K-Medoids:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances_argmin_min
# Create synthetic dataset
X, y = make_blobs(n_samples=300, centers=3, random_state=42)
# K-Medoids Implementation
def k_medoids(X, n_clusters):
   # Initialize random medoids
   medoids_idx = np.random.choice(len(X), n_clusters, replace=False)
   medoids = X[medoids_idx]
   while True:
       # Assign each point to the nearest medoid
       labels = pairwise_distances_argmin_min(X, medoids)[0]
       
       # Update medoids
       new_medoids = np.copy(medoids)
       for i in range(n_clusters):
           cluster_points = X[labels == i]
           distances = np.sum(pairwise_distances_argmin_min(cluster_points, cluster_points)[1])
           new_medoids[i] = cluster_points[np.argmin(distances)]
       
       # If no change in medoids, stop
       if np.all(medoids == new_medoids):
           break
       medoids = new_medoids
   
   return medoids, labels
# Apply K-Medoids
medoids, labels = k_medoids(X, 3)
# Visualize the result
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', marker='X', s=200, label='Medoids')
plt.title("K-Medoids Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

Output:

In this example:

We generate synthetic data using make_blobs.
The k_medoids function implements the K-Medoids algorithm, where the points are assigned to the nearest medoid, and medoids are updated iteratively.
The result is visualized with clusters represented by different colors, and the medoids marked with red 'X's.

Also Read: Key Data Mining Functionalities with Examples for Better Analysis

While K-Medoids is an excellent alternative to K-Means, there are several other variations of centroid based clustering that address different challenges, particularly when working with large datasets or more complex data structures.

These variations provide enhanced flexibility and performance, depending on the nature of your data. Let’s explore a few of these variations:

Mini-Batch K-Means

This variant speeds up the standard K-Means algorithm by updating the centroids using small random batches of data instead of the entire dataset.

Algorithm:

Randomly initialize K centroids.
Repeat until convergence:
- Select a small random batch of data points.
- Assign each point to the nearest centroid.
- Update the centroids based on the selected batch.
Stop when centroids stabilize or after a set number of iterations.

How it works: Instead of using the whole dataset in each iteration, it uses a small subset (mini-batch) to update the centroids. This significantly reduces computation time for large datasets.

Gaussian Mixture Models (GMM)

GMM is a probabilistic clustering technique that assumes data points are generated from a mixture of several Gaussian distributions.

Algorithm:

Initialize the parameters of each Gaussian (mean, covariance, and mixture weight).
Repeat the following until convergence:
- E-Step: Compute the probability of each data point belonging to each Gaussian component.
- M-Step: Update the parameters of the Gaussians (mean, covariance, and weight) based on the probabilities from the E-step.
Stop when the parameters converge or after a set number of iterations.

How it works: Unlike K-Means, which assigns points to a single cluster, GMM assigns probabilities to each data point for belonging to each cluster, allowing for "soft" clustering. It applies the Expectation-Maximization (EM) algorithm to iteratively estimate the parameters of the Gaussian distributions, refining the probability of data points belonging to each cluster to optimize the clustering results.

K-Means++ Initialization

K-Means++ is an improved initialization method that helps reduce the chances of poor clustering by spreading out the initial centroids more effectively.

Algorithm:

Randomly select the first centroid.
For each data point, compute its distance to the nearest existing centroid.
Choose the next centroid with probability proportional to the square of the distance to the nearest centroid.
Repeat steps 2 and 3 until K centroids are selected.
Run the standard K-Means algorithm with the initialized centroids

How it works: It selects the first centroid randomly, then chooses subsequent centroids based on a probability distribution proportional to their distance from the already selected centroids, ensuring better starting points for the K-Means algorithm.

After experimenting with K-Means and its variations, try applying these methods to real-world datasets. Explore clustering with high-dimensional data and non-spherical shapes to see how each variation performs. Experiment with different initialization methods and clustering metrics to fine-tune your results.

Also Read: Top 10 Dimensionality Reduction Techniques for Machine Learning(ML) in 2025

Once you've gained hands-on experience, the next step is understanding where K-Means excels and its limitations.

Advantages and Limitations of Centroid Based Clustering in Data Mining

Understanding the advantages and limitations of centroid based clustering is critical for optimizing its use in data mining tasks. While this method is powerful and widely applicable, it's not a one-size-fits-all solution. By recognizing where it excels and where it falls short, you can make more informed decisions on when to apply this technique.

The table below summarizes the key advantages and limitations of Centroid Based Clustering for quick reference.

Advantages	Limitations	Workaround
Does not require labeled data, making it ideal for discovering patterns in unlabeled datasets.	Assumes clusters are spherical, which may not always be true.	Use DBSCAN or Spectral Clustering for non-spherical clusters.
Efficient in grouping data with complex relationships.	Struggles with high-dimensional data, as distance becomes less meaningful.	Apply PCA to reduce dimensionality before clustering.
Helps uncover hidden patterns and relationships in data.	Sensitive to outliers, which can distort clusters.	Use K-Medoids or Robust Clustering for better outlier handling.
Can identify anomalies and outliers by nature.	Struggles to capture hierarchical clusters.	Use Hierarchical Clustering to handle nested clusters.
Scalable for large datasets, making it suitable for big data applications.	Computationally intensive for very large datasets.	Use Mini-Batch K-Means to speed up clustering for large datasets.

Also Read: Machine Learning Projects with Source Code in 2025

To take your clustering skills further, experiment with different initialization techniques like K-Means++ and try out clustering methods like DBSCAN for non-spherical data. You can also visualize high-dimensional data using PCA or t-SNE and apply Mini-Batch K-Means for faster clustering on large datasets.

Understanding how these methods are applied to fields like customer segmentation, anomaly detection, and recommendation systems will help you see the practical value of clustering in solving real-life problems.

Real Life Applications of Clustering in Machine Learning

Clustering techniques, especially centroid-based methods like K-Means, play a pivotal role in solving a wide range of real-world problems. For example, businesses use clustering to group customers based on purchasing behavior, which helps create targeted marketing campaigns.

This insight will make it easier to apply clustering in your own projects. Below is a table summarizing how it can be used in various real-life applications:

Application	Description
Biological Data Analysis	Clustering is used extensively in genomics, particularly for classifying gene expression data. For example, NASA uses clustering in bioinformatics to analyze gene patterns for disease research. It's crucial in identifying gene expression groups that are linked to various diseases.
Geospatial Data Clustering	Uber and other ride-sharing companies use clustering to analyze geospatial data, grouping areas with high ride demand. This helps optimize pricing models and dispatch systems by identifying "hot spots" for rides in real time.
Market Basket Analysis	Retailers like Amazon use clustering for market basket analysis, grouping products that are often bought together. This informs product placement strategies and personalized recommendations on e-commerce platforms.
Image Compression	JPEG compression relies on clustering techniques to reduce image file sizes. It groups similar pixels together, helping maintain image quality while minimizing storage. It’s used in applications ranging from digital photography to online streaming services.
Document Clustering	Google News uses clustering to group similar news articles, improving content recommendation. It analyzes text data from news sources and clusters similar topics, ensuring users receive relevant, grouped content.

After exploring clustering, you can dive into more advanced topics like Density-Based Clustering (e.g., DBSCAN) for handling noisy data, or Deep Learning for Clustering, such as Autoencoders for unsupervised feature learning.

You can also explore Dimensionality Reduction techniques like t-SNE and UMAP, which complement clustering by making high-dimensional data more manageable. These topics will help you build more sophisticated models for complex datasets.

Now that you’ve gained insights into Centroid Based clustering, take your skills further with the Executive Programme in Generative AI for Leaders by upGrad. This program offers advanced training on clustering techniques and machine learning strategies, preparing you to drive innovation and apply it in complex data mining scenarios.

Test Your Knowledge on Centroid Based Clustering!

Assess your understanding of centroid based clustering, its key components, advantages, limitations, and real-world applications by answering the following multiple-choice questions.

Test your knowledge now!

Q1. What is the primary objective of centroid based clustering?
A) To create hierarchical tree structures
B) To partition data into groups based on similarity
C) To maximize the variance of data within clusters
D) To eliminate outliers from the dataset

Q2. Which of the following is an example of a centroid based clustering algorithm?
A) DBSCAN
B) K-Means
C) Agglomerative Clustering
D) Hierarchical Clustering

Q3. In K-Means clustering, what represents the center of a cluster?
A) A random data point
B) A centroid (mean) of the cluster
C) The farthest data point
D) The median of the cluster

Q4. How does K-Means determine the final cluster centroids?
A) By choosing the data point closest to the cluster’s edge
B) By calculating the average of all data points within a cluster
C) By selecting the centroid randomly
D) By analyzing the data's variance

Q5. What is a key limitation of K-Means clustering?
A) It requires labeled data
B) It assumes clusters are spherical
C) It does not scale with large datasets
D) It struggles with categorical data

Q6. Which technique can be used to improve the initialization of centroids in K-Means?
A) Mini-Batch K-Means
B) K-Means++
C) DBSCAN
D) Gaussian Mixture Models

Q7. How does Mini-Batch K-Means improve the standard K-Means algorithm?
A) By processing smaller subsets of the data at a time
B) By using only categorical data for clustering
C) By performing hierarchical clustering on data
D) By removing outliers before clustering

Q8. When would you consider using K-Medoids over K-Means?
A) When you have large, high-dimensional data
B) When you need to avoid using actual data points as centroids
C) When the data contains significant outliers
D) When your data is perfectly spherical

Q9. What is the primary advantage of Gaussian Mixture Models (GMM) over K-Means?
A) GMM can handle overlapping clusters with probabilistic assignments
B) GMM requires fewer data points for accurate clustering
C) GMM automatically determines the optimal number of clusters
D) GMM is faster in convergence compared to K-Means

Q10. In which scenario would hierarchical clustering be more suitable than centroid based clustering?
A) When the dataset has large, well-separated clusters
B) When the dataset contains a high amount of noise
C) When you need to visualize nested data structures
D) When computational efficiency is the top priority

You can also continue expanding your skills in unsupervised learning with upGrad, which will help you deepen your understanding of centroid based clustering in data mining and its real-life applications.

Become an Expert at Clustering with upGrad!

To gain proficiency in applying centroid based clustering techniques like K-Means and its variations, start by understanding the basics of unsupervised learning, clustering algorithms, and data preprocessing. Many learners face challenges when it comes to implementing these techniques in real-life scenarios.

Trusted by data professionals, upGrad offers courses that teach you how to apply clustering to real-life data, helping you build efficient clustering systems for tasks like segmentation and anomaly detection.

In addition to the courses mentioned, here are some more resources to help you further elevate your skills:

Not sure where to go next in your ML journey? upGrad’s personalized career guidance can help you explore the right learning path based on your goals. You can also visit your nearest upGrad center and start hands-on training today!

Similar Reads:

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Best Machine Learning and AI Courses Online

Master of Science in Machine Learning & AI from LJMU	Executive Post Graduate Programme in Machine Learning & AI from IIITB	Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland
Advanced Certificate Programme in Machine Learning & NLP from IIITB	Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB	View all Machine Learning Courses

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

In-demand Machine Learning Skills

Artificial Intelligence Courses	Tableau Courses
NLP Courses	Deep Learning Courses

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm? Simple & Easy
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is Information Technology?
Permutation vs Combination: Difference between Permutation and Combination	Learning Artificial Intelligence & Machine Learning - How to Start	Machine Learning with R: Everything You Need to Know
NLP Free Course	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

References:

https://theoutpost.ai/news-story/torque-clustering-a-breakthrough-in-unsupervised-ai-learning-11805/

Frequently Asked Questions (FAQs)

1. How does centroid-based clustering work in customer churn prediction for telecom companies?

2. How can centroid-based clustering be applied in disease outbreak prediction?

3. Is centroid based clustering suitable for text data?

4. How do I handle categorical data with centroid based clustering?

5. How do centroid based clustering algorithms compare with hierarchical clustering?

6. Can centroid based clustering handle imbalanced clusters?

7. What are the implications of poor centroid initialization in clustering?

8. Can centroid based clustering be applied to image segmentation?

9. How can you apply centroid based clustering to marketing campaigns?

10. How does centroid based clustering handle changes in data over time?

11. How can centroid based clustering be applied in social network analysis?

Mukesh Kumar

307 articles published

Working with upGrad as a Senior Engineering Manager with more than 10+ years of experience in Software Development and Product Management and Product Testing. Worked with several application configura...

Get Free Consultation

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources