Back to: Machine Learning
1. K-Medoids Clustering Overview
K-Medoids is a partitioning-based clustering algorithm, similar to K-Means, but instead of using the mean to represent the centroid of each cluster, it uses an actual data point (medoid). A medoid is the data point in a cluster that minimizes the sum of the dissimilarities (distances) to all other points in the cluster.
Unlike K-Means, which can be influenced by outliers because it uses the mean (which may be affected by extreme values), K-Medoids is more robust as it uses medoids, which are actual data points.
2. Application of K-Medoids Clustering
K-Medoids clustering is used in many applications, particularly where the data is not naturally spherical (like K-Means assumes), or where the data contains outliers:
- Customer Segmentation: Identifying customer segments based on purchasing behaviors (similar to K-Means but more robust to outliers).
- Image Segmentation: Grouping similar pixels in image processing for object recognition or compression.
- Document Clustering: Grouping text documents based on similarity measures, which can handle non-Euclidean distances.
- Bioinformatics: Identifying clusters of similar biological data like gene expression or protein sequences.
- Anomaly Detection: Detecting unusual or outlying points in data, as K-Medoids is more resilient to outliers.
- Geospatial Clustering: Grouping geographic data points based on distance, like clustering locations of stores or facilities.
3. Advantages of K-Medoids
- Robust to Outliers: K-Medoids is more robust to outliers than K-Means because it uses medoids (real data points) instead of the mean to represent the center of a cluster.
- Works with Non-Euclidean Distances: K-Medoids can use other distance metrics (like Manhattan or Cosine distance), making it more flexible for different types of data (e.g., categorical or textual data).
- Handles Arbitrary Cluster Shapes: Unlike K-Means, which assumes spherical clusters, K-Medoids can handle arbitrary shapes as long as the medoid minimizes dissimilarity.
- Interpretable Results: The medoids themselves are actual data points, so the clusters are easier to interpret.
4. Disadvantages of K-Medoids
- Computational Complexity: K-Medoids is more computationally expensive than K-Means, especially for large datasets, because it involves pairwise distance calculations for all data points.
- Initialization Sensitive: Like K-Means, K-Medoids requires an initial choice of medoids, and the quality of the final clusters can depend on the initial choice.
- Not Scalable for Large Datasets: Due to its high computational complexity (especially with larger datasets), K-Medoids can become slow and inefficient as the dataset grows.
5. K-Medoids Algorithm Calculation Using Sample Data
Let’s use the same sample data for K-Medoids, where we have 6 customers, each with two features: annual income and spending score. We will perform clustering with K=2K = 2 clusters.
Step 1: Sample Data
5. K-Medoids Algorithm Calculation Using Sample Data
Let’s use the same sample data for K-Medoids, where we have 6 customers, each with two features: annual income and spending score. We will perform clustering with K=2K = 2 clusters.
Step 1: Sample Data
Customer | Annual Income (k) | Spending Score (1-100) |
---|---|---|
C1 | 15 | 39 |
C2 | 25 | 50 |
C3 | 35 | 60 |
C4 | 50 | 75 |
C5 | 60 | 85 |
C6 | 80 | 95 |
Step 2: Choose the Number of Clusters (K)
Let’s assume we want to divide the data into 2 clusters (K=2K = 2).
Step 3: Initialize Medoids
Randomly choose two data points as initial medoids (e.g., from the existing data points):
- Medoid 1: C1=(15,39)
- Medoid 2: C4=(50,75)
Step 4: Assign Points to the Nearest Medoid
Now, assign each data point to the medoid it is closest to. We’ll calculate the Manhattan distance (absolute differences between coordinates) for simplicity.
The Manhattan distance between two points (x1,y1)and (x2,y2) is:
d=∣x2−x1∣+∣y2−y1∣
Let’s compute the Manhattan distance from each data point to the medoids:
For example, calculate the distance from Customer 1 (C1) to Medoid 1 and Medoid 2:
- Distance from C1C1 to Medoid 1:
d(C1,M1)=∣15−15∣+∣39−39∣ - Distance from C1 to Medoid 2:
d(C1,M2)=∣15−50∣+∣39−75∣=|15 – 50| + |39 – 75| = | -35 | + | -36 | = 35 + 36 = 71
We repeat this calculation for all customers and assign each customer to the closest medoid.
Step 5: Update Medoids
Once each point has been assigned to the nearest medoid, calculate the new medoids. A medoid is the data point that minimizes the sum of the dissimilarities (Manhattan distances) to all other points in the same cluster.
For example, if customers C1, C2, and C3 are assigned to Cluster 1, we would calculate the total Manhattan distance from each data point to all others in Cluster 1, and the point with the minimum distance will be the new medoid for that cluster.
Step 6: Repeat Until Convergence
Repeat steps 4 and 5 until the medoids no longer change, meaning the algorithm has converged.