K-means Clustering

Creator
Created
Created
2019 Nov 5 3:14
Editor
Edited
Edited
2025 May 20 20:48

Optimize minimize sum of similarity measure (
Euclidean Distance
)

  1. Initialize cluster centroids randomly
  1. Assign examples in dataset to clusters by
  1. Update centroid for each cluster using
  1. Repeat steps 2 and 3 until convergence

Complexity

K is hyper-parameter and K center, N number of data point
  1. Starting from randomly chosen K centroids
  1. Given μ, find optimal assignment variable z - calculate distance based on dimension
  1. Given z, find the optimal centroids (mean) μ
  1. Until Convergence (proved always)

Calculating Loss after convergence for evaluation or
Elbow method

Pros

  • Relatively simple to implement
  • Scale to large datasets
  • Guarantee convergence (local optima)

Cons

  • K does need to be specified
  • Depends on initial values
  • Clusters are Voronoi-diagram of centers
  • Cannot model covariances well

EM Algorithm

  • E Step: Point Reassignment -
  • M Step: Centroid Reassignment -
K-means Notion
 
 

Limitation

It is sensitive to initial centroids, data outlier
K-means Substitutes
notion image
 
 
 
 
 

Recommendations