Why the metric matters
Any nearness based method, from neighbors to clustering, needs a rule for distance. The choice of metric reshapes which points look close, so it directly changes predictions.
Common metrics
- Euclidean distance is straight line distance, the everyday default.
- Manhattan distance sums absolute differences, like city blocks, and resists extreme single dimensions.
- Cosine similarity compares the angle between vectors, ignoring magnitude, useful for text.
- Hamming distance counts mismatching positions for categorical or binary data.
Scaling and the curse
- Without scaling, features in large units dominate Euclidean distance.
- In very high dimensions, distances between points grow similar, the curse of dimensionality, weakening nearness.
Picking one
- Use cosine when direction matters more than length, as with documents.
- Use Manhattan when you want robustness to a few large coordinate gaps.
- Match the metric to the data type and the meaning of similarity.
Key idea
Distance metrics define what nearness means. Euclidean, Manhattan, cosine, and Hamming each suit different data, and scaling plus dimensionality strongly affect them.