For working professionals
For fresh graduates
More
49. Variance in ML
Did you know? In situations with significant noise or outliers, K-Medoids can outperform K-Means regarding clustering accuracy. Studies have shown that for datasets containing just 10-15% outliers, the error rate of K-Means can be more than double that of K-Medoids in identifying the true underlying clusters.
Hierarchical clustering excels at uncovering the inherent structure within data by building a hierarchy of clusters, a contrast to partitioning methods like K-Means. While K-Means clustering defines cluster centers as the mean of the data points within a cluster, a sensitive measure to outliers, K-Medoids distinguishes itself by selecting actual data points as cluster centers, known as medoids. This makes K-Medoids inherently more robust when dealing with noisy data or outliers.
The choice between these algorithms hinges on the dataset's characteristics and the desired robustness to extreme values. Our comprehensive exploration will cover the various types of hierarchical clustering, key algorithms, and robust evaluation techniques!
Deepen your understanding of algorithms like K-Medoids and K-Means! Explore upGrad's comprehensive online AI and ML programs to master clustering techniques and build robust machine learning models for complex data analysis.
K-Medoids clustering is a partitioning clustering algorithm that divides a dataset into k distinct, non-overlapping clusters. Unlike K-Means, which uses the mean data points within a cluster (the centroid) as its representative center, K-Medoids selects actual data points as cluster centers.
Key Characteristics of K-Medoids:
Importance of the Distance Matrix:
Categorical Data: Dissimilarity can be measured using metrics like:
The K-Medoids clustering algorithm typically follows these steps:
Key Features of K-Medoids Clustering:
To further enhance your understanding of clustering techniques, including K-Medoids and its practical applications, explore these highly recommended programs in Artificial Intelligence and Machine Learning:
K-Medoids clustering offers a robust alternative to K-Means. Using actual data points as cluster representatives makes it less susceptible to outliers and applicable to diverse data types through a distance matrix.
Also Read: Clustering in Machine Learning: Learn About Different Techniques
In this section, you'll discover the step-by-step process that defines the K-Medoids algorithm. You'll learn how it initializes by selecting medoids, then iteratively assigns data points to these representatives and refines the medoid selection to minimize cluster dissimilarity. This detailed breakdown will clarify the algorithm's mechanics.
1. Initialization:
Randomly select k data points from the dataset as initial medoids. Let's call the current medoids M={m1,m2,...,mk}, where each mi is a data point.
While straightforward, this random initialization can have implications. Especially in datasets with noise or outliers, a randomly chosen medoid might be an extreme value, potentially leading to suboptimal initial clusters. This can affect the algorithm's convergence speed and the quality of the final clustering.
2. Assignment:
For each data point pi in the dataset (that is not a medoid), assign it to the cluster whose medoid mj∈M is closest to it based on a chosen distance metric d(pi,mj). This assignment aims to minimize the distance of each point to its cluster's representative.
3. Swap and Update (Iterative Refinement):
This is the core of the algorithm's iterative improvement. For each cluster Ci and for each data point h in Ci that is not currently the medoid mi, consider swapping mi with h. Calculate the total dissimilarity (sum of distances) of all points in Ci to this new potential medoid h.
If this total dissimilarity is lower than the total dissimilarity with the current medoid mi, then update cluster Ci's medoid to h. This process is repeated for all clusters and all non-medoid points within each cluster in each iteration. This step aims to minimize the total dissimilarity within each cluster.
The total dissimilarity for a cluster Ci with medoid mi is calculated as:
The algorithm aims to minimize the total dissimilarity across all clusters:
Distance Matrix Illustration (Conceptual):
Imagine a table where both rows and columns represent your data points. Each cell (i,j) in this table contains the distance between data points i and j. The K-Medoids clustering algorithm uses these pairwise distances to determine the best medoids and assign cluster points.
| Data Point | Point 1 | Point 2 | Point 3 | ... | Point N |
|------------|---------|---------|---------|-----|---------|
| Point 1 | 0 | d₁₂ | d₁₃ | ... | d₁N |
| Point 2 | d₂₁ | 0 | d₂₃ | ... | d₂N |
| Point 3 | d₃₁ | d₃₂ | 0 | ... | d₃N |
| ... | ... | ... | ... | 0 | ... |
| Point N | dN₁ | dN₂ | dN₃ | ... | 0 |
Here, dij represents the distance between data points i and j.
4. Termination:
The algorithm stops when the medoids no longer change after an iteration or when a predefined stopping criterion (e.g., maximum number of iterations) is met. A common variant of the K-Medoids algorithm is PAM (Partitioning Around Medoids). PAM is a specific implementation that systematically tries all possible swaps of medoids with non-medoids to find the configuration that minimizes the total dissimilarity.
Now, let's consider the efficiency of this approach.
The computational complexity of the basic K-Medoids clustering algorithm, particularly the PAM implementation, is generally higher than that of K-Means. In each iteration, for each k cluster, and for each non-medoid point within that cluster, the algorithm calculates the total dissimilarity if that point were to become the new medoid.
If n is the number of data points and k is the number of clusters, a naive implementation of PAM can have a time complexity of approximately O(k(n−k)2) per iteration. Since the number of iterations can also be significant in some cases, the overall complexity can become relatively high, especially for large datasets.
Impact on Usability:
The slower speed of K-Medoids clustering is a trade-off for its advantages in handling outliers and diverse data types. Understanding this complexity is essential when choosing the appropriate clustering algorithm for a given task and dataset size.
Master the power of unsupervised learning and clustering techniques like K-Medoids! Explore upGrad's comprehensive program on Unsupervised Learning: Clustering, trusted by over 11,000 learners, offers 11 hours of in-depth knowledge and practical skills, including clustering algorithms, Google Analytics integration for cluster analysis, and data cleaning.
Also Read: Types of Regression in Machine Learning: 18 Advanced Models
K-Medoids offers robustness through its use of medoids, which are actual data points used as cluster centers. In contrast, K-Means clustering is a widely adopted and efficient unsupervised learning algorithm that partitions data into a pre-defined number (K) of clusters using centroids, the mean of data points within a cluster.
This fundamental distinction makes K-Medoids more resilient to the influence of outliers, as medoids are actual data points and are less affected by extreme values. K-Means, on the other hand, operates by iteratively assigning data points to the nearest centroid and then recalculating the centroids to minimize within-cluster variance.
In this section, you'll learn the K-Means algorithm's step-by-step process. We'll break down how it starts with initial guesses for cluster centers and then iteratively refines these centers by assigning data points and recalculating means until stable clusters are formed.
Also Read: Everything You Need to Know About Binary Logistic Regression
Elevate your data analysis skills by mastering inferential statistics! Explore upGrad's comprehensive program on the Basics of Inferential Statistics, trusted by over 18,000 learners. Invest 15 hours in learning essential concepts like probability and statistical inference, crucial for effective data analytics
While K-Means is a robust and widely used algorithm, it's essential to be aware of its limitations in order to apply it effectively. Let's explore common challenges when using K-Means and understand why they matter in practice.
Aspect | Description | Example |
Sensitivity to Outliers | Centroids are based on the mean, so outliers can significantly distort cluster centers. K-Medoids are less affected by using medoids instead of means. | A single customer making a huge purchase skews the centroid of the spending cluster, misrepresenting typical spending behavior. |
Impact of Initial Centroid Selection | Random starting points can lead to different clustering results. Poor initialization might lead to suboptimal clustering. | Starting centroid placement in different areas on a data map can lead to drastically different groupings. Re-running the algorithm can help find better groupings. |
Assumption of Spherical Clusters | K-Means assumes clusters are spherical and similar in size, using Euclidean distance as a measure. It struggles with irregularly shaped or unequal-sized clusters. | Clustering customer browsing behavior might result in varied, non-spherical clusters, which K-Means fails to capture accurately. |
Reliance on Euclidean Distance | K-Means uses Euclidean distance, which might not be suitable for all data types, especially high-dimensional or text-based data. | Clustering documents by topic might require cosine similarity, which captures semantic similarity better than straight-line distance. |
Convergence to Local Minima | K-Means might settle in a local minimum rather than the optimal clustering. Multiple runs can yield different results, and within-cluster variance (WCSS) evaluation can help assess clustering quality. | Running K-Means multiple times on the same dataset may result in different clusters; not all are equally effective in minimizing intra-cluster distances. |
By understanding these limitations, you can make more informed decisions about when to use K-Means and how to interpret its results. You can also recognize scenarios where an algorithm like K-Medoids Clustering might be more suitable.
Also Read: Guide to Decision Tree Algorithm: Applications, Pros & Cons
Here are some solutions to the limitations of K-Means clustering:
1. Sensitivity to Outliers:
2. Impact of Initial Centroid Selection:
3. Assumption of Spherical Clusters and Euclidean Distance:
4. Reliance on Euclidean Distance for Similarity:
5. Potential for Convergence to Local Minima:
In this section, you'll gain a clear understanding of the fundamental distinctions between K-Means and K-Medoids. We'll dissect their core mechanisms, particularly how they represent cluster centers, and analyze how these differences impact their sensitivity to outliers, use of distance metrics, computational demands, and the interpretability of their results.
This structured comparison will illuminate when to favor one algorithm over another in your clustering tasks.
Feature | K-Means | K-Medoids |
Type of Center | Centroid (mean of cluster points) | Medoid (actual data point within the cluster) |
Sensitivity to Outliers | High | Low (more robust) |
Distance Metric Usage | Typically Euclidean distance | Flexible; any distance metric can be used |
Computational Complexity | Generally O(nkI), efficient for large n | Generally higher, around O(k(n−k)2I) for PAM |
Interpretability and Stability | Lower interpretability of centers; less stable due to sensitivity to initialization and outliers | Higher interpretability of centers; more stable against outliers |
Also Read: Neural Network Architecture: Types, Components & Key Algorithms
Now that you understand the core difference between K Means and K Medoids, let's explore their typical use cases. The choice between these algorithms often hinges on the specific characteristics of your data and the goals of your analysis.
K-Means: This algorithm shines when dealing with large, relatively clean datasets with primarily numerical features. Its efficiency makes it well-suited for tasks like:
Also Read: Segmentation in Marketing: Get Started with Effective Strategies
K-Medoids: On the other hand, K-Medoids clustering proves particularly valuable when your dataset contains mixed data types (if a suitable distance metric is defined), is prone to anomalies or outliers. Applications include:
The selection ultimately depends on understanding your data's underlying distribution and your tolerance for noise.
If you anticipate significant outliers or require representative data points as your cluster centers, K-Medoids is often the more reliable choice, despite its higher computational cost. Conversely, K-Means remains a powerful and efficient tool for large, well-behaved numerical datasets where speed is critical.
Transform your e-commerce strategies by mastering data science! Join over 22,000 learners in upGrad's Data Science in E-commerce program, offering 13 hours of focused learning. Develop crucial skills in data analysis, A/B testing, machine learning, and more to drive sales and optimize performance.
Also Read: K Means Clustering in R: Step-by-Step Tutorial with Example
To better understand K-Medoids Clustering, we will take up a practical example using Python. We'll implement the algorithm using the sklearn_extra library, which provides a K-Medoids implementation compatible with scikit-learn's API. We'll use the classic Iris dataset and introduce some artificial noise to highlight K-Medoids' robustness compared to K-Means.
Code Example:
# Install sklearn_extra if not already installed
# !pip install scikit-learn-extra
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# --- Step 1: Load Iris dataset and use two features for easy 2D visualization ---
iris = load_iris()
X = iris.data[:, :2] # Using only sepal length and width
y = iris.target
# --- Step 2: Add artificial outliers (noise) to simulate real-world data ---
np.random.seed(42)
n_outliers = 10
outliers = np.random.uniform(low=0, high=8, size=(n_outliers, 2))
# Combine data
X_noisy = np.vstack((X, outliers))
y_noisy = np.hstack((y, -np.ones(n_outliers))) # Label outliers as -1
# --- Step 3: Standardize the features ---
scaler = StandardScaler()
X_noisy_scaled = scaler.fit_transform(X_noisy)
# --- Step 4: Apply K-Medoids ---
n_clusters = 3
kmedoids = KMedoids(n_clusters=n_clusters, random_state=42)
kmedoids.fit(X_noisy_scaled)
kmedoids_labels = kmedoids.labels_
kmedoids_medoid_indices = kmedoids.medoid_indices_
kmedoids_medoids = X_noisy_scaled[kmedoids_medoid_indices]
kmedoids_centers = kmedoids.cluster_centers_
# --- Step 5: Apply K-Means for comparison ---
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(X_noisy_scaled)
kmeans_labels = kmeans.labels_
kmeans_centers = kmeans.cluster_centers_
# --- Step 6: Visualize both clustering results ---
plt.figure(figsize=(12, 5))
# K-Medoids Visualization
plt.subplot(1, 2, 1)
plt.scatter(X_noisy_scaled[:, 0], X_noisy_scaled[:, 1], c=kmedoids_labels, cmap='viridis', s=50)
plt.scatter(kmedoids_medoids[:, 0], kmedoids_medoids[:, 1], c='red', marker='X', s=200, label='Medoids')
plt.title('K-Medoids Clustering on Noisy Iris Data')
plt.xlabel('Sepal Length (Standardized)')
plt.ylabel('Sepal Width (Standardized)')
plt.legend()
# K-Means Visualization
plt.subplot(1, 2, 2)
plt.scatter(X_noisy_scaled[:, 0], X_noisy_scaled[:, 1], c=kmeans_labels, cmap='viridis', s=50)
plt.scatter(kmeans_centers[:, 0], kmeans_centers[:, 1], c='red', marker='D', s=200, label='Centroids')
plt.title('K-Means Clustering on Noisy Iris Data')
plt.xlabel('Sepal Length (Standardized)')
plt.ylabel('Sepal Width (Standardized)')
plt.legend()
plt.tight_layout()
plt.show()
# --- Step 7: Print medoid and centroid locations ---
print("\nK-Medoids Medoid Indices:", kmedoids_medoid_indices)
print("K-Medoids Medoid Locations (Standardized):\n", kmedoids_medoids)
print("\nK-Means Centroid Locations (Standardized):\n", kmeans_centers)
Output:
K-Medoids Medoid Indices: [ 35 98 143]
K-Medoids Medoid Locations (Standardized):
[[-0.0527 -0.6041]
[ 0.8553 0.9916]
[-1.4581 -1.1852]]
K-Means Centroid Locations (Standardized):
[[-0.0956 -0.6174]
[ 1.0876 1.1359]
[-1.3771 -1.1167]]
Explanation:
Visualization Explanation:
Output: We print the indices and locations of the K-Medoids medoids and the locations of the K-Means centroids.
In summary: The output and visualization support the idea that K-Medoids is more robust to outliers. Its cluster centers (medoids) represent most of the data, while K-Means' centroids are pulled away by the extreme values. This highlights a key difference between K Means and K-Medoids in practice.
While K-Medoids offers distinct advantages, it's essential to understand when it is the most appropriate choice compared to other clustering techniques. Let's delve into specific scenarios where K-Medoids shines.
Selecting the most appropriate clustering algorithm is critical in any unsupervised learning task. While both K-Means and K-Medoids aim to partition data into k clusters, their underlying mechanisms and sensitivities make them suitable for different scenarios.
Here are some decision pointers to guide you in choosing between these two powerful techniques, considering the difference between K Means and K-medoids.
Scenario | Preferred Algorithm(s) | Key Considerations |
Large Datasets | K-Means | Computational efficiency is paramount. |
Relatively Clean, Continuous Data | K-Means | The data is primarily numerical, with few outliers, and clusters are expected to be roughly spherical and of similar size. |
Speed and Scalability are Primary Concerns | K-Means | Real-time processing or handling huge volumes of data requires a fast and scalable algorithm. |
Small to Medium-Sized Datasets | K-Medoids | Computational cost is less of a constraint, and robustness to outliers is desired. |
Noisy Data with Outliers | K-Medoids | The dataset contains significant outliers that could distort cluster centers if a mean-based approach is used. |
Mixed Data Types or Non-Euclidean Distances | K-Medoids | The data includes categorical features or the relationships between data points are better captured by distance metrics other than Euclidean (which K-Medoids can accommodate). |
Interpretability of Cluster Centers is Key | K-Medoids | The cluster representatives need to be actual data points for easier understanding and actionability. |
Potentially Non-Spherical or Unequal-Sized Clusters | K-Medoids (with appropriate distance metric) or other algorithms (DBSCAN, GMM, Spectral Clustering) | While K-Medoids has some flexibility, other algorithms might be better suited for highly complex cluster shapes. However, K-Medoids with a tailored distance metric can outperform K-Means in some non-spherical scenarios. |
The Importance of Business Context and Data Characteristics:
Ultimately, the choice between K-Means and K-Medoids (or any other clustering algorithm) should be driven by a deep understanding of your business context and the specific characteristics of your data.
Achieve the power of well-structured data with upGrad's Introduction to Database Design with MySQL course, trusted by over 6,000 learners. In just 8 hours, gain essential skills in database design, data analysis, warehousing, and ETL processes.
1. Which of the following best describes the center of a cluster in K-Medoids clustering?
a) The mean of all data points in the cluster.
b) A randomly selected data point from the dataset.
c) The data point within the cluster that minimizes the sum of dissimilarities to other points in the same cluster.
d) The data point closest to the centroid calculated by K-Means.
2 .How does K-Medoids handle outliers compared to K-Means?
a) K-Means is more robust to outliers.
b) K-Medoids is more sensitive to outliers.
c) K-Medoids is generally less sensitive to outliers.
d) Both algorithms are equally affected by outliers.
3. Which distance metric is exclusively used by the standard K-Medoids algorithm?
a) Euclidean distance.
b) Manhattan distance.
c) Cosine similarity.
d) K-Medoids is flexible and can use various distance metrics.
4. What is the primary goal of the iterative "swap" step in the PAM (Partitioning Around Medoids) algorithm, a common implementation of K-Medoids?
a) To randomly reassign data points to different clusters.
b) To calculate the mean of the data points in each cluster.
c) To find a new set of medoids that minimizes the total dissimilarity within clusters.
d) To increase the distance between cluster centers.
5.What is a potential drawback of K-Medoids compared to K-Means, especially for very large datasets?
a) K-Medoids is more likely to converge to a local minimum.
b) K-Medoids typically has a higher computational complexity.
c) K-Medoids can only handle numerical data.
d) K-Medoids requires the number of clusters to be unknown.
6. In which of the following scenarios would K-Medoids likely be preferred over K-Means?
a) Clustering millions of images based on pixel values for compression.
b) Clustering a small dataset of patient records with potential data entry errors.
c) Clustering web pages based on keyword frequency where speed is critical.
d) Clustering sensor data known to be clean and forms perfect spherical groups.
7. What is the role of the "medoid" in K-Medoids clustering?
a) It represents the average location of all data points.
b) It is an initial guess for the cluster center refined iteratively.
c) It is an actual data point that serves as the center of a cluster.
d) It is a boundary point that separates different clusters.
8. Which of the following is NOT a typical step in the K-Medoids algorithm?
a) Randomly selecting initial medoids.
b) Assigning each data point to the cluster with the closest medoid.
c) Recalculating cluster centers as the mean of the assigned points.
d) Iteratively swapping medoids with non-medoid points to improve clustering.
9. How does the interpretability of cluster centers differ between K-Means and K-Medoids?
a) K-Means centers are always actual data points, making them more interpretable.
b) K-Medoids centers are always actual data points, making them more interpretable.
c) Both algorithms produce equally interpretable cluster centers.
d) Neither algorithm produces easily interpretable cluster centers.
10 If you have a dataset with a mix of numerical and categorical features and want to perform clustering, which algorithm might be more suitable if you can define a dissimilarity measure for the categorical features?
a) K-Means (directly).
b) K-Means after one-hot encoding all categorical features.
c) K-Medoids.
d) Neither K-Means nor K-Medoids can handle mixed data types.
Also Read: The Role of Machine Learning and Data Visualization in AI
So, you've now understood the intricacies of K-Medoids clustering, its algorithm, its advantages over K-Means, and its practical applications. With this knowledge, you can confidently approach datasets with outliers or non-standard distance metrics, making more informed decisions about which clustering technique best suits your unique analytical challenges.
Ready to apply your K-Medoids knowledge? Consider these upGrad programs to further your data science journey:
Feeling lost in the sea of data science options or unsure which path aligns with your career aspirations? Speak to our counselors or visit our learning centers for personalized guidance and clarity on the right courses to bridge your skill gaps and achieve your goals.
K-Means can sometimes suffer from the "curse of dimensionality," where distances become less meaningful. While also affected, K-Medoids might offer an advantage if a robust non-Euclidean distance metric suitable for high dimensions is used, and the number of data points isn't huge.
Yes, unlike K-Means, which relies on means, K-Medoids can directly handle categorical data. This is achieved by employing a suitable dissimilarity measure, such as the Gower distance, to calculate differences between data points. This method allows for clustering based on categorical attributes without requiring a conversion to numerical representations, offering a more natural approach for such data types.
Determining the optimal k for K-Medoids involves using evaluation techniques similar to those used for K-Means. Methods like the elbow method, silhouette analysis, and the gap statistic can be applied. These techniques assess the quality of the clustering for various values of k, helping to identify the point where adding more clusters provides diminishing returns or the silhouette score is maximized, indicating a good separation between clusters.
Yes, the basic K-Medoids algorithm, particularly PAM, can be computationally intensive for large datasets. More scalable variants like CLARA (Clustering Large Applications) and CLARANS (Clustering Large Applications based upon RANdomized Search) have been developed. CLARA works by sampling the dataset, applying PAM to the samples, and then finding the best medoids. CLARANS uses a randomized search to explore potential medoid sets more efficiently than an exhaustive search.
K-Medoids, like K-Means, operates on the principle of finding relatively compact clusters around the medoids and can struggle when faced with clusters of significantly different densities or irregular shapes. The algorithm's objective function doesn't explicitly account for density variations. In such cases, density-based clustering algorithms like DBSCAN, which group points based on their local density, might be more effective at identifying clusters with varying characteristics.
The initial choice of medoids in K-Medoids can influence the final clustering outcome, potentially leading to suboptimal solutions if the initial medoids are poorly chosen. To mitigate this sensitivity, it is common practice to run the K-Medoids algorithm multiple times, each with a different random initialization of the medoids. The final clustering result is then typically selected based on the run that yields the lowest total dissimilarity or the highest silhouette score, aiming for a more robust and representative clustering.
The standard K-Medoids algorithm generally assumes complete data and does not inherently handle missing values. The calculation of pairwise dissimilarities, which is central to the algorithm, usually requires all attribute values to be present. Therefore, before applying K-Medoids to datasets with missing values, preprocessing steps such as imputation (replacing missing values with estimated ones) or employing distance metrics that can naturally handle missing data are necessary to ensure the algorithm can function correctly.
K-Medoids can cluster mixed data using distance metrics like Gower distance. This metric calculates dissimilarity by considering the nature of each variable. For numerical features, it uses normalized ranges, and for categorical, it uses mismatch. This allows for a unified dissimilarity measure across different data types, enabling effective clustering.
K-Medoids is preferable when you have an idea of the number of clusters (k) and expect relatively compact groups. Unlike DBSCAN, it always produces k clusters. Compared to hierarchical clustering, it's more efficient for large datasets when a specific number of clusters is needed. Its robustness to outliers is also a key advantage over algorithms like K-Means.
The choice of distance metric is crucial as it defines how similarity is measured. For numerical data, Euclidean or Manhattan distances might be used. For categorical data, Hamming or Jaccard indices are suitable. For mixed data, Gower distance can be employed. Selecting a metric that aligns with the data's properties and the relationships you want to uncover is essential for meaningful clustering results.
Metrics like the silhouette score, Davies-Bouldin index, or the total within-cluster dissimilarity can be used to assess the quality of the K-Medoids clustering and compare different clustering results.
Talk to our experts. We are available 7 days a week, 9 AM to 12 AM (midnight)
Indian Nationals
1800 210 2020
Foreign Nationals
+918068792934
1.The above statistics depend on various factors and individual results may vary. Past performance is no guarantee of future results.
2.The student assumes full responsibility for all expenses associated with visas, travel, & related costs. upGrad does not provide any a.