SIMD Vectorization in Depth

Processing multiple data elements per instruction using wide vector registers and lanes.

One instruction, many elements

SIMD stands for single instruction multiple data. A SIMD instruction operates on a wide vector register holding several elements at once, called lanes. Instead of adding two numbers, one add instruction adds, say, eight pairs in parallel within a single core.

How vectorization happens

Turning a scalar loop into SIMD is vectorization. A compiler can do it automatically when the loop is simple and independent, or a programmer can write it explicitly with intrinsics.

Operate on a chunk of lanes per iteration.
Use a remainder loop for the leftover elements that do not fill a full vector.
Keep iterations independent so lanes do not depend on each other.

What blocks vectorization

Several things stop a loop from vectorizing:

Data dependencies where one iteration needs the previous result.
Branches inside the loop, though masking can sometimes handle them by computing all lanes and selecting results.
Misaligned or non contiguous data that the vector load cannot pack efficiently.

Effective SIMD also needs aligned, contiguous data so a full vector loads in one step.

Key idea

SIMD vectorization packs several data elements into vector lanes so one instruction processes them together, but it only works when iterations are independent and data is contiguous and aligned.

SIMD Vectorization in Depth

One instruction, many elements

How vectorization happens

What blocks vectorization

Key idea

Check yourself