The Store Buffer Forwarding Deep Dive

Hiding store latency

Writing to memory is slow, so a CPU places each store into a store buffer and lets the core continue without waiting for the value to reach cache. The buffer drains to the cache later, asynchronously, which hides write latency but introduces subtle ordering effects.

Store to load forwarding

If the same core later loads an address it just stored, it must see its own write. The core uses store to load forwarding: the load reads the pending value directly from the store buffer rather than from cache. This keeps a single thread self consistent even before the store is globally visible.

The store buffer anomaly

The famous problem appears when two threads each store one variable then load the other. Because each store sits in a private buffer not yet visible to the other core, both loads can read the old value, an outcome forbidden by sequential consistency.

Forwarding satisfies the local thread but not remote threads.
This is why store load reordering is allowed on common hardware.
A full fence flushes or orders the buffer to forbid this anomaly.

Key idea

Store buffers hide write latency and forward to local loads, but because buffered stores are not yet visible to other cores they permit store load reordering unless a full fence intervenes.

The Store Buffer Forwarding Deep Dive

Hiding store latency

Store to load forwarding

The store buffer anomaly

Key idea

Check yourself