One instruction, many elements
SIMD stands for single instruction multiple data. A SIMD instruction operates on a wide vector register holding several elements at once, called lanes. Instead of adding two numbers, one add instruction adds, say, eight pairs in parallel within a single core.
How vectorization happens
Turning a scalar loop into SIMD is vectorization. A compiler can do it automatically when the loop is simple and independent, or a programmer can write it explicitly with intrinsics.
- Operate on a chunk of lanes per iteration.
- Use a remainder loop for the leftover elements that do not fill a full vector.
- Keep iterations independent so lanes do not depend on each other.
What blocks vectorization
Several things stop a loop from vectorizing:
- Data dependencies where one iteration needs the previous result.
- Branches inside the loop, though masking can sometimes handle them by computing all lanes and selecting results.
- Misaligned or non contiguous data that the vector load cannot pack efficiently.
Effective SIMD also needs aligned, contiguous data so a full vector loads in one step.
Key idea
SIMD vectorization packs several data elements into vector lanes so one instruction processes them together, but it only works when iterations are independent and data is contiguous and aligned.