View All
View All
View All
View All
View All
View All
View All
    View All
    View All
    View All
    View All
    View All

    What is Centroid Based Clustering? Implementation, Variations & Applications

    By Mukesh Kumar

    Updated on May 09, 2025 | 28 min read | 1.5k views

    Share:

    Did you know? The term "K-Means" was first coined by James MacQueen in 1967, but the concept dates back to Hugo Steinhaus in 1956. The algorithm was later popularized by Stuart Lloyd and Edward Forgy in the 1950s and 1960s, leading to the development of the K-Means method. 

    Centroid-based clustering is a method where data points are grouped based on their similarity to a central point, called the centroid. The problem? Choosing the right technique and understanding its variations can be tricky. 

    In this tutorial, you’ll learn how centroid based clustering works, explore its different forms, and discover how it applies to real-life problems. 

    Improve your machine learning skills with upGrad’s online AI and ML courses. Specialize in cybersecurity, full-stack development, game development, and much more. Take the next step in your learning journey! 

    What is Centroid Based Clustering? Key Concepts and Types

    Clustering is a fundamental technique in unsupervised machine learning where data points are grouped based on their similarities. The objective is to identify inherent patterns or structures within the data without predefined labels.

    Working with centroid-based clustering goes beyond simply applying the algorithm. To make the most of it, it's essential to focus on data preprocessing, adjusting the number of clusters, and accurately interpreting the results of the clustering. Here are three programs that can help you sharpen these skills:

    In centroid-based clustering, each cluster is represented by a central point known as the centroid, that acts as the "average" of all data points in that cluster. This approach works well for partitioning data into distinct groups where each cluster can be described by its central point, simplifying the analysis and interpretation of complex datasets. 

    There are two main types of clustering:

    • Hierarchical Clustering: Builds a tree-like structure, called a dendrogram, where data points are grouped progressively. It can be:
      • Agglomerative (bottom-up approach): Starts with individual points and merges clusters.
      • Divisive (top-down approach): Starts with all points in one cluster and divides them into smaller clusters.
    • Partitioning/Centroid-Based Clustering: Divides the data into a predefined number of clusters. The most common method is K-Means, where clusters are formed around centroids. Other variants include:
      • K-Medoids: Uses actual data points as cluster centers.
      • Mini-Batch K-Means: Uses small random samples for faster clustering on large datasets.

    The mathematical foundation behind centroid based clustering in data mining is key to its simplicity and effectiveness. By minimizing the distance between data points and their respective centroids, it ensures that the clusters are as compact and well-separated as possible. 

    Also Read: What is Cluster Analysis in Data Mining? Methods, Benefits, and More

    Mathematical Foundation of Clustering

    Placement Assistance

    Executive PG Program11 Months
    background

    Liverpool John Moores University

    Master of Science in Machine Learning & AI

    Dual Credentials

    Master's Degree17 Months

    At its core, the mathematical foundation of clustering focuses on how data points are grouped based on their similarities, often by minimizing a specific distance measure. Understanding this foundation helps you grasp how algorithms like K-Means and K-Medoids work to define clusters with precision.

    1. Distance Metrics

    Distance metrics are the backbone of clustering algorithms, as they define how "similar" or "distant" two data points are from each other. Different clustering algorithms rely on various distance measures to group data points.

    • Euclidean Distance: The most commonly used metric, especially in algorithms like K-Means. It measures the straight-line distance between two points in a Euclidean space.
      • Formula:

        d ( p , q ) = i = 1 n ( p i - q i ) 2

    Example: In a 2D space, it calculates the straight-line distance between two points (x1, y1) and (x2, y2).

    • Manhattan Distance: Also known as "city block distance", it measures the sum of absolute differences along each dimension.
      • Formula:

        d ( p , q ) = i = 1 n | p i - q i |

    Example: In a grid-like city, it calculates the total number of blocks you’d walk to get from one point to another.

    • Cosine Similarity: Measures the cosine of the angle between two vectors, useful for text data or when the magnitude of vectors doesn’t matter, just the direction.
      • Formula:

        cosine   similarity   =   ( A · B ) | | A | |   | | B | |  

    Example: Used in document clustering, where text documents are represented as vectors.

    • Minkowski Distance: A generalization of both Euclidean and Manhattan distances, which includes a parameter "p" to adjust the type of distance measure.
      • Formula:

        d ( p , q ) = i = 1 n | p i - q i | p 1 p
    • Example: When p=1, it’s Manhattan distance; when p=2, it’s Euclidean distance.

    These distance metrics help determine how "close" points are to each other and guide the assignment of points to clusters in algorithms like K-Means or DBSCAN.

    1. Centroids

    A centroid is the central point in a cluster, representing the "average" of all the points within that cluster. The centroid is a key concept in centroid based clustering methods like K-Means.

    • Role in Clustering: In K-Means, the centroid is used to assign points to clusters and to represent each cluster. The algorithm iterates by adjusting the centroid based on the current members of the cluster, aiming to minimize the distance between points and their centroid.
    • Calculation: For a cluster with n points, the centroid is the arithmetic mean of the coordinates of the points.
      • Formula for a 2D cluster:

        C X = 1 n i = 1 n X i ;   C Y = 1 n i = 1 n Y i

    Where Cx​ and Care the coordinates of the centroid, and xi, yi​ are the coordinates of individual points in the cluster.

    Also Read: What is Logistic Regression in Machine Learning?

    1. Objective Function

    The objective function is what clustering algorithms optimize to form meaningful clusters. In centroid based clustering, the goal is typically to minimize the distance between data points and their corresponding centroids.

    It measures the "quality" of the clustering by quantifying how well the data points fit into their assigned clusters. It helps the algorithm decide when it has found an optimal solution.

    • K-Means Objective Function: In K-Means clustering, the objective function is the Sum of Squared Errors (SSE), which calculates the total distance between each data point and its assigned centroid. 

    The algorithm minimizes this value by adjusting centroids and reassigning points.

    • Formula:

      J = i = 1 k x j c i | | x j - μ i | | 2
    • Where:
      • J is the objective function (SSE),
      • K is the number of clusters,
      • xj is a data point in cluster ci,
      • μi​ is the centroid of cluster ci​,
      • ∣∣xj​−μi∣∣ is the distance between data point xj and the centroid μi\mu_iμi​.

    Minimizing the Sum of Squared Errors (SSE) ensures that each cluster is as tight as possible, meaning the points within a cluster are as close to the centroid as possible. This results in better-defined clusters, making it easier to interpret and analyze the data. 

    Understanding how this mathematical principle works sets the stage for learning about K-Means, as it directly applies the concept of centroids and SSE to partition data into meaningful groups. 

    Also Read: Maths for Machine Learning Specialisation 

    What is K-Means Clustering? Implementation and Evaluation

     

    K-Means is a popular partitioning clustering algorithm used to divide a dataset into K distinct clusters. The objective is to minimize the Sum of Squared Errors (SSE), ensuring that data points within a cluster are as close as possible to the cluster's centroid. The algorithm operates in an iterative process:

    1. Initialization: Randomly select K centroids from the dataset.
    2. Assignment Step: Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance).
    3. Update Step: Recalculate the centroids by finding the mean of all data points assigned to each cluster.
    4. Repeat the assignment and update steps until convergence, meaning the centroids no longer change.

    In the update step, centroids are recalculated as the mean of all points within their cluster. These steps are repeated until the algorithm converges, meaning the centroids no longer change significantly. The process of convergence is crucial in ensuring that the clustering model has stabilized. 

    Convergence and Stopping Criteria dictate when the algorithm stops its iterations. This happens when the centroids no longer shift or when a predefined maximum number of iterations is reached. 

    By enforcing these criteria, we ensure that the final clusters are as optimal as possible, based on the defined objective function, leading to a stable and accurate model.

    Also Read: Gradient Descent Algorithm: Methodology, Variants & Best Practices 

    Practical Considerations in K-Means

    In K-Means, several practical factors can directly affect the quality and reliability of your clustering results. By addressing these aspects, you can avoid suboptimal clusters and ensure that the algorithm produces meaningful, accurate groupings. 

    1. Choosing the Right K (Number of Clusters) 

    One of the most important aspects of K-Means is selecting the right number of clusters, K. Choosing too few clusters can oversimplify the data, while choosing too many can lead to overfitting. 

    The following methods help in determining the best value for K:

    • Elbow Method: This technique involves plotting the Sum of Squared Errors (SSE) for various values of K and observing where the curve bends or flattens out (the "elbow"). The point at which the SSE starts decreasing at a slower rate is the ideal K.

    Example: In customer segmentation, if the elbow occurs at K=3, it suggests that three clusters best represent the customers' purchasing behaviors.

    • Silhouette Score: Measures how close each point in one cluster is to the points in neighboring clusters. A higher score indicates well-separated and dense clusters.

    Example: If your data is split into clusters of high-value and low-value customers, a higher silhouette score would indicate that these groups are distinct and well-defined.

    • Gap Statistic: This method compares the clustering result with that of a random dataset. A large gap between the real and random clustering suggests a good clustering result.

    Example: If you're clustering images based on their similarity, a large gap would indicate a well-chosen K, as the real clusters differ greatly from random groupings.

    Also Read: What is Overfitting & Underfitting In Machine Learning ? [Everything You Need to Learn] 

    1. Initial Centroid Selection 

    The initialization of centroids can greatly affect the clustering outcome. Poor initial centroids can lead to local minima, where the algorithm converges prematurely without finding the optimal clustering.

    • Random Initialization Problem: Randomly selecting initial centroids can sometimes place them too close to each other, causing poor clustering results and slower convergence.

    Example: If you’re clustering data for product recommendations and start with centroids close to each other, the algorithm might incorrectly group diverse products together, leading to inaccurate recommendations.

    • K-Means++: A more sophisticated method for selecting initial centroids. It spreads out the initial centroids, reducing the likelihood of poor initialization and improving the final results.

    Example: In a dataset of geographical locations, K-Means++ would ensure that the initial centroids are spread across the map, leading to more accurate clustering of regions with distinct characteristics.

    1. Outliers 

    Outliers can significantly distort the clustering results, as they pull centroids away from the true center of the data.

    • Impact of Outliers: Outliers affect the mean of the cluster, shifting the centroid and leading to poorly defined clusters.

    Example: If you’re clustering employees based on salary and experience, outliers like a few extremely high earners could skew the results and place them in the wrong cluster, causing inaccurate groupings of similar employees.

    • Handling Outliers: You can either remove outliers from the dataset or use clustering methods that are more resistant to them, like K-Medoids, which uses actual data points as centroids instead of the mean.

    Example: For customer segmentation, removing extreme outliers (such as a customer who makes an unusually large purchase once) would help in forming clusters that better represent typical customer behaviors.

    Understanding and addressing these practical considerations in K-Means ensures that you can avoid common pitfalls and make the most of the algorithm's potential. 

    Also Read: Outlier Analysis in Data Mining: Techniques, Detection Methods, and Best Practices

    Next, let’s put these concepts into action and see how K-Means can help you efficiently cluster your data and gain meaningful insights.

    Practical Example: Implementing K-Means

    Implementing K-Means is especially beneficial because it is computationally efficient and works well with datasets where clusters are roughly spherical and well-separated. With this hands-on example, you'll learn how to apply K-Means to segment your data, find patterns, and generate actionable insights. 

    Step 1: Install Required Libraries

    First, ensure you have the required libraries installed. You can install them using pip if you don't have them yet: 

    pip install scikit-learn matplotlib numpy

    Step 2: Import Libraries

    Next, import the necessary libraries for data manipulation, clustering, and visualization. 

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.datasets import make_blobs
    • Numpy is for numerical operations.
    • matplotlib.pyplot is for plotting graphs.
    • KMeans is the clustering algorithm from scikit-learn.
    • make_blobs is used to generate synthetic data for clustering.

    Struggling with data manipulation and visualization? Check out upGrad’s free Learn Python Libraries: NumPy, Matplotlib & Pandas course. Gain the skills to handle complex datasets and create powerful visualizations. Start learning today!

    Step 3: Generate Synthetic Data

    For this example, we’ll create a simple synthetic dataset using make_blobs. This function generates clusters of points for us to cluster. 

    # Create a synthetic dataset with 2 features and 3 clusters
    X, y = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

    This generates 300 data points divided into 3 clusters.
    Step 4: Apply K-Means Clustering
    Now, let’s apply the K-Means algorithm to this data. We’ll set K=3 since we know the data has 3 clusters. 

    # Apply K-Means clustering
    kmeans = KMeans(n_clusters=3)
    kmeans.fit(X)
    # Get the cluster centroids
    centroids = kmeans.cluster_centers_
    # Get the labels (cluster assignments for each data point)
    labels = kmeans.labels_

    Here:

    • n_clusters=3 specifies that we want to divide the data into 3 clusters.
    • fit(X) runs the K-Means algorithm on the data.
    • cluster_centers_ gives the coordinates of the centroids of the clusters.
    • labels_ contains the cluster assignment for each data point.

    Step 5: Visualize the Clusters

    Let’s plot the data points and their respective cluster centroids to visualize the result of the clustering. 

    # Plot the data points and the centroids
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.6)  # Data points
    plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X')  # Centroids
    plt.title("K-Means Clustering")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.show()
    • The data points are colored based on their cluster assignment using the c=labels argument.
    • The centroids are marked with red 'X' symbols.

    Output:

    Step 6: Evaluate the Results

    We can use the inertia attribute of the K-Means object to evaluate how well the clusters were formed. Inertia measures the total distance between the data points and their respective centroids. A lower inertia value indicates better clustering. 

    # Print the inertia (sum of squared distances to the closest centroid)
    print(f"Inertia: {kmeans.inertia_}")

    After implementing K-Means, the next step is to experiment with different values of K to find the optimal number of clusters using methods like the Elbow Method or Silhouette Score. 

    Once you've fine-tuned your clustering model, the next crucial step is Evaluating Clustering Performance to ensure the quality of your clusters.

    Also Read: Top 15 Types of Data Visualization: Benefits and How to Choose the Right Tool for Your Needs in 2025

    Evaluating Clustering Performance

    Evaluating the performance of your clustering model is essential to ensure the quality and validity of the results. Without proper evaluation, you risk drawing incorrect conclusions from poorly defined clusters. For example, if you're segmenting customers for targeted marketing, poorly defined clusters could lead to ineffective campaigns that miss the mark. 

    1. Internal Evaluation Metrics

    Internal evaluation metrics assess the quality of your clustering by looking at the structure and coherence of the clusters themselves. These metrics do not require any external labels, making them ideal for unsupervised learning.

    • Silhouette Score 

    The Silhouette Score measures how similar each point is to its own cluster compared to other clusters. A score close to +1 means the point is well-clustered, while a score close to -1 suggests the point might be incorrectly assigned. Used by Netflix to evaluate how well the user segments created for personalized recommendations are defined and distinct.

    Why it matters: It gives you a clear indication of how well the data points fit within their clusters. 

    How to calculate

    from sklearn.metrics import silhouette_score
    score = silhouette_score(X, labels)
    print(f"Silhouette Score: {score}")
    • Davies-Bouldin Index 

    The Davies-Bouldin Index evaluates cluster quality by comparing the average distance between the clusters to their internal cohesion. A lower score indicates better clustering. Applied by Amazon to measure the quality of customer clusters in order to tailor targeted marketing campaigns.

    Why it matters: It balances both the compactness of clusters and their separation. 

    How to calculate:

    from sklearn.metrics import davies_bouldin_score
    db_score = davies_bouldin_score(X, labels)
    print(f"Davies-Bouldin Index: {db_score}")
    • Dunn Index 

    The Dunn Index identifies clusters that are well-separated and internally compact. A higher value indicates better clustering. Used by Spotify to assess the separation and cohesion of music genre clusters, ensuring better music recommendations based on user preferences. 

    Why it matters: It focuses on the distance between clusters relative to the size of the clusters. 

    How to calculate: This is less straightforward and typically requires custom implementation, but it’s useful for comparing different clustering configurations.

    D = max(Intra-cluster distance) / min(Inter-cluster distance)​

    • Inter-cluster distance: The distance between the closest points from different clusters.
    • Intra-cluster distance: The maximum distance between points within the same cluster.

    A higher Dunn Index indicates better clustering, with well-separated and compact clusters.

    To calculate it, you need to compute pairwise distances between all clusters and their members. It’s not directly available in scikit-learn, but custom code can be written to calculate it. 

    Here's a rough idea of how you might implement it: 

    from sklearn.metrics import pairwise_distances
    import numpy as np
    def dunn_index(X, labels):
       # Calculate the pairwise distances between all points
       pairwise_dist = pairwise_distances(X)
       
       # Initialize variables
       min_intercluster_distance = np.inf
       max_intracluster_distance = -np.inf
       
       # Loop through each cluster and compute distances
       for i in range(len(set(labels))):
           # Points in the current cluster
           cluster_points = X[labels == i]
           
           # Intra-cluster distances (max distance within the cluster)
           max_intracluster_distance = max(max_intracluster_distance, np.max(pairwise_dist[labels == i][:, labels == i]))
           
           # Inter-cluster distances (min distance between clusters)
           for j in range(i+1, len(set(labels))):
               cluster_j_points = X[labels == j]
               inter_distance = np.min(pairwise_dist[labels == i][:, labels == j])
               min_intercluster_distance = min(min_intercluster_distance, inter_distance)
       
       # Return Dunn Index
       return min_intercluster_distance / max_intracluster_distance
    1. External Evaluation Metrics

    External evaluation metrics compare your clustering results against a known ground truth. These metrics are particularly useful when you have labeled data available, as they provide an objective measure of how well the clustering algorithm matched the expected results. 

    They help you validate your clustering performance and ensure the model's output is meaningful.

    Also Read: What are Sklearn Metrics and Why You Need to Know About Them? 

    • Adjusted Rand Index (ARI) 

    The ARI measures how similar your clustering is to a ground truth classification. It accounts for chance, making it a more reliable comparison. A score close to 1 indicates perfect agreement with the true labels. Used by Google News to compare clustering results of news articles with actual topics, helping improve content categorization.

    Why it matters: It lets you compare clustering results to known ground truth. 

    How to calculate

    from sklearn.metrics import adjusted_rand_score
    ari = adjusted_rand_score(true_labels, labels)
    print(f"Adjusted Rand Index: {ari}")
    • Normalized Mutual Information (NMI) 

    NMI quantifies the amount of shared information between your clustering and the ground truth. A value of 1 means the clustering is identical to the true labels, while 0 means there is no information shared. Applied by Twitter to assess the similarity between user clusters based on their activity and engagement, helping improve ad targeting.

    Why it matters: It’s a good measure when you want to quantify the similarity between the predicted clusters and true labels. 

    How to calculate:

    from sklearn.metrics import normalized_mutual_info_score
    nmi = normalized_mutual_info_score(true_labels, labels)
    print(f"Normalized Mutual Information: {nmi}")
    1. Visualizing Clusters

    Visualizing clusters is an effective way to understand the results of clustering algorithms. By reducing the dimensions of your data (using methods like t-SNE or PCA), you can get a clearer, more intuitive sense of how well your data has been grouped. 

    Visualization helps to identify patterns, outliers, and potential improvements for the clustering model.

    • t-SNE (t-distributed Stochastic Neighbor Embedding) 

    t-SNE is a dimensionality reduction technique that helps visualize high-dimensional data by reducing it to two or three dimensions. It is particularly useful for visualizing clusters. Used by Amazon to visualize customer behavior clusters, helping improve product recommendations and marketing strategies.

    Why it matters: It helps you understand the spatial distribution of your clusters, especially in complex datasets. 

    How to visualize:

    from sklearn.manifold import TSNE
    tsne = TSNE(n_components=2)
    X_tsne = tsne.fit_transform(X)
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
    plt.title("t-SNE Visualization")
    plt.show()
    • PCA (Principal Component Analysis) 

    PCA is another technique for reducing the dimensionality of your data while preserving variance. It is often used to plot the data in 2D or 3D for easier visualization of clusters. Applied by Facebook to reduce the dimensions of user interaction data for efficient clustering and targeted content delivery.

    Why it matters: PCA helps identify the most important dimensions of your data and shows how clusters are distributed in these dimensions. 

    How to visualize:

    from sklearn.decomposition import PCA
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X)
    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
    plt.title("PCA Visualization")
    plt.show()

    After evaluating the clustering performance, you can experiment with different clustering algorithms to see how they compare to K-Means. Try applying K-Medoids or Mini-Batch K-Means for larger datasets. 

    Also Read: Introduction to Classification Algorithm: Concepts & Various Types 

    Explore how variations like Gaussian Mixture Models (GMM) work in handling more complex cluster shapes. Let's dive deeper into these advanced variations and their applications.

    Advanced Variations and Extensions of K-Means Clustering

    Advanced variations and extensions of K-Means clustering address its limitations, making it more versatile and applicable to a wider range of data. These methods improve K-Means by enhancing its efficiency, scalability, and ability to handle more complex datasets. 

    For example, K-Medoids deals with outliers better, while Mini-Batch K-Means speeds up the algorithm for large datasets. 

    K-Medoids

    K-Medoids is a variation of K-Means   clustering that uses actual data points as the centroids (medoids) of clusters, rather than calculating the mean of the data points. This method is more robust to outliers and is ideal for datasets with noisy or non-numeric data. 

    Unlike K-Means, K-Medoids minimizes the sum of dissimilarities between points and medoids, rather than the squared Euclidean distance.

    Here's a quick comparison between K-Means and K-Medoids to highlight the key differences:

    K-Medoids is a variation of K-Means   clustering that uses actual data points as the centroids (medoids) of clusters, rather than calculating the mean of the data points. This method is more robust to outliers and is ideal for datasets with noisy or non-numeric data. 

    Unlike K-Means, K-Medoids minimizes the sum of dissimilarities between points and medoids, rather than the squared Euclidean distance.

    Here's a quick comparison between K-Means and K-Medoids to highlight the key differences:

    Aspect

    K-Means

    K-Medoids

    Centroid Calculation Uses the mean of data points Uses actual data points (medoids)
    Sensitivity to Outliers Sensitive to outliers More robust to outliers
    Data Types Primarily works with numerical data Can work with any data type (e.g., categorical, numeric)
    Computational Efficiency Generally faster for large datasets Slower, especially with large datasets
    Cluster Shape Assumes spherical clusters Can handle non-spherical clusters

    K-Medoids is particularly beneficial when dealing with datasets that include outliers or categorical data, where the mean might not represent the "center" of the data well.

    K-Medoids operates similarly to K-Means, but instead of using the mean of the data points in a cluster, it selects an actual data point as the cluster’s centroid (medoid). Here's a breakdown of how it works:

    1. Initialize K medoids: Randomly select K data points from the dataset as the initial medoids.
    2. Assign points to the nearest medoid: For each data point, calculate the distance to each medoid, and assign the point to the closest medoid.
    3. Update the medoids: For each cluster, calculate the data point that minimizes the total distance to all other points in the cluster, and set it as the new medoid.
    4. Repeat the assignment and update steps until convergence (no change in medoids).

    Here’s a simple Python implementation using the PAM (Partitioning Around Medoids) method for K-Medoids:

    import numpy as np
    from sklearn.datasets import make_blobs
    from sklearn.metrics import pairwise_distances_argmin_min
    # Create synthetic dataset
    X, y = make_blobs(n_samples=300, centers=3, random_state=42)
    # K-Medoids Implementation
    def k_medoids(X, n_clusters):
       # Initialize random medoids
       medoids_idx = np.random.choice(len(X), n_clusters, replace=False)
       medoids = X[medoids_idx]
       while True:
           # Assign each point to the nearest medoid
           labels = pairwise_distances_argmin_min(X, medoids)[0]
           
           # Update medoids
           new_medoids = np.copy(medoids)
           for i in range(n_clusters):
               cluster_points = X[labels == i]
               distances = np.sum(pairwise_distances_argmin_min(cluster_points, cluster_points)[1])
               new_medoids[i] = cluster_points[np.argmin(distances)]
           
           # If no change in medoids, stop
           if np.all(medoids == new_medoids):
               break
           medoids = new_medoids
       
       return medoids, labels
    # Apply K-Medoids
    medoids, labels = k_medoids(X, 3)
    # Visualize the result
    import matplotlib.pyplot as plt
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
    plt.scatter(medoids[:, 0], medoids[:, 1], c='red', marker='X', s=200, label='Medoids')
    plt.title("K-Medoids Clustering")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.legend()
    plt.show()

    Output: 

    In this example:

    • We generate synthetic data using make_blobs.
    • The k_medoids function implements the K-Medoids algorithm, where the points are assigned to the nearest medoid, and medoids are updated iteratively.
    • The result is visualized with clusters represented by different colors, and the medoids marked with red 'X's.

    Also Read: Key Data Mining Functionalities with Examples for Better Analysis 

    While K-Medoids is an excellent alternative to K-Means, there are several other variations of centroid based clustering that address different challenges, particularly when working with large datasets or more complex data structures. 

    These variations provide enhanced flexibility and performance, depending on the nature of your data. Let’s explore a few of these variations:

    • Mini-Batch K-Means 

    This variant speeds up the standard K-Means algorithm by updating the centroids using small random batches of data instead of the entire dataset.

    Algorithm:

    1. Randomly initialize K centroids.
    2. Repeat until convergence:
      • Select a small random batch of data points.
      • Assign each point to the nearest centroid.
      • Update the centroids based on the selected batch.
    3. Stop when centroids stabilize or after a set number of iterations.

    How it works: Instead of using the whole dataset in each iteration, it uses a small subset (mini-batch) to update the centroids. This significantly reduces computation time for large datasets.

    • Gaussian Mixture Models (GMM) 

    GMM is a probabilistic clustering technique that assumes data points are generated from a mixture of several Gaussian distributions.

    Algorithm: 

    1. Initialize the parameters of each Gaussian (mean, covariance, and mixture weight).
    2. Repeat the following until convergence:
      • E-Step: Compute the probability of each data point belonging to each Gaussian component.
      • M-Step: Update the parameters of the Gaussians (mean, covariance, and weight) based on the probabilities from the E-step.
    3. Stop when the parameters converge or after a set number of iterations.

    How it works: Unlike K-Means, which assigns points to a single cluster, GMM assigns probabilities to each data point for belonging to each cluster, allowing for "soft" clustering. It applies the Expectation-Maximization (EM) algorithm to iteratively estimate the parameters of the Gaussian distributions, refining the probability of data points belonging to each cluster to optimize the clustering results.

    • K-Means++ Initialization 

    K-Means++ is an improved initialization method that helps reduce the chances of poor clustering by spreading out the initial centroids more effectively.

    Algorithm:

    1. Randomly select the first centroid.
    2. For each data point, compute its distance to the nearest existing centroid.
    3. Choose the next centroid with probability proportional to the square of the distance to the nearest centroid.
    4. Repeat steps 2 and 3 until K centroids are selected. 
    5. Run the standard K-Means algorithm with the initialized centroids

    How it works: It selects the first centroid randomly, then chooses subsequent centroids based on a probability distribution proportional to their distance from the already selected centroids, ensuring better starting points for the K-Means algorithm.

    After experimenting with K-Means and its variations, try applying these methods to real-world datasets. Explore clustering with high-dimensional data and non-spherical shapes to see how each variation performs. Experiment with different initialization methods and clustering metrics to fine-tune your results. 

    Also Read: Top 10 Dimensionality Reduction Techniques for Machine Learning(ML) in 2025 

    Once you've gained hands-on experience, the next step is understanding where K-Means excels and its limitations.

    Advantages and Limitations of Centroid Based Clustering in Data Mining

    Understanding the advantages and limitations of centroid based clustering is critical for optimizing its use in data mining tasks. While this method is powerful and widely applicable, it's not a one-size-fits-all solution. By recognizing where it excels and where it falls short, you can make more informed decisions on when to apply this technique.

    The table below summarizes the key advantages and limitations of Centroid Based Clustering for quick reference.

    Advantages

    Limitations

    Workaround

    Does not require labeled data, making it ideal for discovering patterns in unlabeled datasets. Assumes clusters are spherical, which may not always be true. Use DBSCAN or Spectral Clustering for non-spherical clusters.
    Efficient in grouping data with complex relationships. Struggles with high-dimensional data, as distance becomes less meaningful. Apply PCA to reduce dimensionality before clustering.
    Helps uncover hidden patterns and relationships in data. Sensitive to outliers, which can distort clusters. Use K-Medoids or Robust Clustering for better outlier handling.
    Can identify anomalies and outliers by nature. Struggles to capture hierarchical clusters. Use Hierarchical Clustering to handle nested clusters.
    Scalable for large datasets, making it suitable for big data applications. Computationally intensive for very large datasets. Use Mini-Batch K-Means to speed up clustering for large datasets.

    Also Read: Machine Learning Projects with Source Code in 2025

    To take your clustering skills further, experiment with different initialization techniques like K-Means++ and try out clustering methods like DBSCAN for non-spherical data. You can also visualize high-dimensional data using PCA or t-SNE and apply Mini-Batch K-Means for faster clustering on large datasets. 

    Understanding how these methods are applied to fields like customer segmentation, anomaly detection, and recommendation systems will help you see the practical value of clustering in solving real-life problems.

    Real Life Applications of Clustering in Machine Learning

    Clustering techniques, especially centroid-based methods like K-Means, play a pivotal role in solving a wide range of real-world problems. For example, businesses use clustering to group customers based on purchasing behavior, which helps create targeted marketing campaigns. 

    This insight will make it easier to apply clustering in your own projects. Below is a table summarizing how it can be used in various real-life applications:

    Application

    Description

    Biological Data Analysis Clustering is used extensively in genomics, particularly for classifying gene expression data. For example, NASA uses clustering in bioinformatics to analyze gene patterns for disease research. It's crucial in identifying gene expression groups that are linked to various diseases.
    Geospatial Data Clustering Uber and other ride-sharing companies use clustering to analyze geospatial data, grouping areas with high ride demand. This helps optimize pricing models and dispatch systems by identifying "hot spots" for rides in real time.
    Market Basket Analysis Retailers like Amazon use clustering for market basket analysis, grouping products that are often bought together. This informs product placement strategies and personalized recommendations on e-commerce platforms.
    Image Compression JPEG compression relies on clustering techniques to reduce image file sizes. It groups similar pixels together, helping maintain image quality while minimizing storage. It’s used in applications ranging from digital photography to online streaming services.
    Document Clustering Google News uses clustering to group similar news articles, improving content recommendation. It analyzes text data from news sources and clusters similar topics, ensuring users receive relevant, grouped content.

    After exploring clustering, you can dive into more advanced topics like Density-Based Clustering (e.g., DBSCAN) for handling noisy data, or Deep Learning for Clustering, such as Autoencoders for unsupervised feature learning. 

    You can also explore Dimensionality Reduction techniques like t-SNE and UMAP, which complement clustering by making high-dimensional data more manageable. These topics will help you build more sophisticated models for complex datasets.

    Now that you’ve gained insights into Centroid Based clustering, take your skills further with the Executive Programme in Generative AI for Leaders by upGrad. This program offers advanced training on clustering techniques and machine learning strategies, preparing you to drive innovation and apply it in complex data mining scenarios.

    Test Your Knowledge on Centroid Based Clustering!

    Assess your understanding of centroid based clustering, its key components, advantages, limitations, and real-world applications by answering the following multiple-choice questions.

    Test your knowledge now!

    Q1. What is the primary objective of centroid based clustering?
    A) To create hierarchical tree structures
    B) To partition data into groups based on similarity
    C) To maximize the variance of data within clusters
    D) To eliminate outliers from the dataset

    Q2. Which of the following is an example of a centroid based clustering algorithm?
    A) DBSCAN
    B) K-Means
    C) Agglomerative Clustering
    D) Hierarchical Clustering

    Q3. In K-Means clustering, what represents the center of a cluster?
    A) A random data point
    B) A centroid (mean) of the cluster
    C) The farthest data point
    D) The median of the cluster

    Q4. How does K-Means determine the final cluster centroids?
    A) By choosing the data point closest to the cluster’s edge
    B) By calculating the average of all data points within a cluster
    C) By selecting the centroid randomly
    D) By analyzing the data's variance

    Q5. What is a key limitation of K-Means clustering?
    A) It requires labeled data
    B) It assumes clusters are spherical
    C) It does not scale with large datasets
    D) It struggles with categorical data

    Q6. Which technique can be used to improve the initialization of centroids in K-Means?
    A) Mini-Batch K-Means
    B) K-Means++
    C) DBSCAN
    D) Gaussian Mixture Models

    Q7. How does Mini-Batch K-Means improve the standard K-Means algorithm?
    A) By processing smaller subsets of the data at a time
    B) By using only categorical data for clustering
    C) By performing hierarchical clustering on data
    D) By removing outliers before clustering

    Q8. When would you consider using K-Medoids over K-Means?
    A) When you have large, high-dimensional data
    B) When you need to avoid using actual data points as centroids
    C) When the data contains significant outliers
    D) When your data is perfectly spherical

    Q9. What is the primary advantage of Gaussian Mixture Models (GMM) over K-Means?
    A) GMM can handle overlapping clusters with probabilistic assignments
    B) GMM requires fewer data points for accurate clustering
    C) GMM automatically determines the optimal number of clusters
    D) GMM is faster in convergence compared to K-Means

    Q10. In which scenario would hierarchical clustering be more suitable than centroid based clustering?
    A) When the dataset has large, well-separated clusters
    B) When the dataset contains a high amount of noise
    C) When you need to visualize nested data structures
    D) When computational efficiency is the top priority

    You can also continue expanding your skills in unsupervised learning with upGrad, which will help you deepen your understanding of centroid based clustering in data mining and its real-life applications.

    Become an Expert at Clustering with upGrad!

    To gain proficiency in applying centroid based clustering techniques like K-Means and its variations, start by understanding the basics of unsupervised learning, clustering algorithms, and data preprocessing. Many learners face challenges when it comes to implementing these techniques in real-life scenarios.

    Trusted by data professionals, upGrad offers courses that teach you how to apply clustering to real-life data, helping you build efficient clustering systems for tasks like segmentation and anomaly detection.

    In addition to the courses mentioned, here are some more resources to help you further elevate your skills: 

    Not sure where to go next in your ML journey? upGrad’s personalized career guidance can help you explore the right learning path based on your goalsYou can also visit your nearest upGrad center and start hands-on training today!  

    Similar Reads:

    Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

    Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

    Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

    References:

    https://theoutpost.ai/news-story/torque-clustering-a-breakthrough-in-unsupervised-ai-learning-11805/

    Frequently Asked Questions (FAQs)

    1. How does centroid-based clustering work in customer churn prediction for telecom companies?

    2. How can centroid-based clustering be applied in disease outbreak prediction?

    3. Is centroid based clustering suitable for text data?

    4. How do I handle categorical data with centroid based clustering?

    5. How do centroid based clustering algorithms compare with hierarchical clustering?

    6. Can centroid based clustering handle imbalanced clusters?

    7. What are the implications of poor centroid initialization in clustering?

    8. Can centroid based clustering be applied to image segmentation?

    9. How can you apply centroid based clustering to marketing campaigns?

    10. How does centroid based clustering handle changes in data over time?

    11. How can centroid based clustering be applied in social network analysis?

    Mukesh Kumar

    272 articles published

    Get Free Consultation

    +91

    By submitting, I accept the T&C and
    Privacy Policy

    India’s #1 Tech University

    Executive Program in Generative AI for Leaders

    76%

    seats filled

    View Program

    Top Resources

    Recommended Programs

    LJMU

    Liverpool John Moores University

    Master of Science in Machine Learning & AI

    Dual Credentials

    Master's Degree

    17 Months

    IIITB
    bestseller

    IIIT Bangalore

    Executive Diploma in Machine Learning and AI

    Placement Assistance

    Executive PG Program

    11 Months

    upGrad
    new course

    upGrad

    Advanced Certificate Program in GenerativeAI

    Generative AI curriculum

    Certification

    4 months