K Means Clustering Revisited
K means is the workhorse of unsupervised partitioning. It splits a dataset into k clusters so that points in the same group are close to a shared center, called a centroid.
The two step loop
K means alternates between two steps until the assignments stop changing.
- Assignment step: attach every point to its nearest centroid using Euclidean distance.
- Update step: move each centroid to the mean of the points assigned to it.
Repeating these steps minimizes the within cluster sum of squares, the total squared distance from points to their centroids.
Practical notes
- You must choose k in advance, which is a real limitation.
- The result depends on the random starting centroids, so methods like k means plus plus pick smarter seeds.
- K means assumes clusters are roughly round and similarly sized, so it struggles with stretched or nested shapes.
Why it converges
Each step can only lower or hold the objective, and there are finitely many assignments, so the loop reaches a stable point. That point may be a local minimum, which is why several restarts are common.
Key idea
K means iterates assignment and update steps to minimize within cluster spread, but needs a chosen k and good seeds.