The goal
K means is an unsupervised method that partitions data into k groups so that points within a cluster are close together. You choose k in advance, and the algorithm finds the cluster centers, called centroids.
The loop
K means alternates two steps until it settles:
- Assignment, where each point joins the nearest centroid
- Update, where each centroid moves to the mean of its assigned points
Repeating these steps steadily lowers the total within cluster distance until assignments stop changing.
Caveats
- Results depend on the initial centroids, so smart seeding like k means plus plus helps
- It assumes roughly round, similarly sized clusters and struggles with odd shapes
- Choosing k often uses the elbow method or silhouette scores
- Features should be scaled since the method relies on distance
Key idea
K means iterates assignment and update steps to place k centroids that minimize within cluster distance, sensitive to initialization and cluster shape.