ML: K_means, K_medoids

Suppose we have eight two-dimensional points, we can use k-means unsupervised clustering to classify the data point:

K-means clustering is sensitive to outliers, but this algorithm is scalable.

Different from K-means, K-medoids chooses actual data points as centers. It is less sensitive to outliers, but not suitable for big data sets.

Instead of prior knowledge, Elbow method can also be used to determine the number of centers. It runs K-means on the dataset for a range of values of k, then computes an aggregated score for each value of k. Silhouette scores, ranging from -1 to 1, can be used to assess the goodness of the number of centers. the greater the better. Another score, the distortion scores, the sum of squared distances from each point to its centers, can also be computed. A third score, calinski-harabasz score, the ratio of dispersion between and withink clusters, can also be used.

Then plot the calculated scores with the variation of k. We can visually determine the best value for k by detecting the ‘elbow’, where the point of inflection on the curve, to identify the optimal value of k.