K-means

Creator

Creator

Created

Created

2019 Nov 5 3:14

Editor

Editor

Edited

Edited

2025 Mar 11 11:9

Refs

Refs

Monte Carlo Method

K-nearest Neighbor

Optimize minimize sum of similarity measure (
Euclidean Distance)

Initialize cluster centroids $\mu_1 \dots \mu_k \in R^m$ randomly

Assign examples in dataset $S$ to clusters $c_1\dots c_k$ by $c(x_i) = \argmin_j||x_i - \mu_j||^2$

Update centroid $\mu_j$ for each cluster $c_j$ using $\mu_j :=\frac{\sum_{x_i\in c_j}x_j}{|c_j|}$

Repeat steps 2 and 3 until convergence

Complexity

K is hyper-parameter and K center, N number of data point

Starting from randomly chosen K centroids

Given μ, find optimal assignment variable z $O(KND)$ - calculate distance based on dimension

Given z, find the optimal centroids (mean) μ $O(N)$

Until Convergence (proved always)

Calculating Loss after convergence for evaluation or
Elbow method

L(\mu, X) = \min \sum_i^n||x_i- \mu_{c(x_i)}||^2

Pros

Relatively simple to implement

Scale to large datasets

Guarantee convergence (local optima)

Cons

K does need to be specified

Depends on initial values

Clusters are Voronoi-diagram of centers

Cannot model covariances well

K-means Notion

Assignment Variable

Limitation

It is sensitive to initial centroids, data outlier

K-means Substitutes

Soft K-means Algorithm

notion image

Recommendations

////////