K-means

Creator
Created
Created
2019 Nov 5 3:14
Editor
Edited
Edited
2025 Mar 11 11:9

Optimize minimize sum of similarity measure (
Euclidean Distance
)

  1. Initialize cluster centroids μ1μkRm\mu_1 \dots \mu_k \in R^m randomly
  1. Assign examples in dataset SS to clusters c1ckc_1\dots c_k by c(xi)=arg minjxiμj2c(x_i) = \argmin_j||x_i - \mu_j||^2
  1. Update centroid μj\mu_j for each cluster cjc_j using μj:=xicjxjcj\mu_j :=\frac{\sum_{x_i\in c_j}x_j}{|c_j|}
  1. Repeat steps 2 and 3 until convergence

Complexity

K is hyper-parameter and K center, N number of data point
  1. Starting from randomly chosen K centroids
  1. Given μ, find optimal assignment variable z O(KND)O(KND) - calculate distance based on dimension
  1. Given z, find the optimal centroids (mean) μ O(N)O(N)
  1. Until Convergence (proved always)

Calculating Loss after convergence for evaluation or
Elbow method

L(μ,X)=mininxiμc(xi)2L(\mu, X) = \min \sum_i^n||x_i- \mu_{c(x_i)}||^2

Pros

  • Relatively simple to implement
  • Scale to large datasets
  • Guarantee convergence (local optima)

Cons

  • K does need to be specified
  • Depends on initial values
  • Clusters are Voronoi-diagram of centers
  • Cannot model covariances well
K-means Notion
 
 

Limitation

It is sensitive to initial centroids, data outlier
K-means Substitutes
notion image
 
 
 
 
 

Recommendations