Optimize minimize sum of similarity measure (Euclidean Distance)
- Initialize cluster centroids randomly
- Assign examples in dataset to clusters by
- Update centroid for each cluster using
- Repeat steps 2 and 3 until convergence
Complexity
K is hyper-parameter and K center, N number of data point
- Starting from randomly chosen K centroids
- Given μ, find optimal assignment variable z - calculate distance based on dimension
- Given z, find the optimal centroids (mean) μ
- Until Convergence (proved always)
Calculating Loss after convergence for evaluation or Elbow method
K-means Notion
Limitation
It is sensitive to initial centroids, data outlier
K-means Substitutes