Memory is not uniform
On large servers memory is split among sockets. In a non uniform memory access design each socket has its own local memory, and reaching another socket memory crosses an interconnect.
The latency gap
- Accessing local memory is fast.
- Accessing remote memory on another socket is slower, often noticeably so.
- Bandwidth to remote memory is also limited by the interconnect.
So the same instruction can be cheap or expensive depending on where the data physically lives.
Designing for NUMA
Performance hinges on placement. The common policy is first touch, where a page is allocated on the socket of the thread that first writes it. To exploit this, pin threads to cores and have each thread initialize the data it will use, keeping accesses local.
Ignoring NUMA leads to all memory landing on one socket while threads spread across others, saturating one interconnect and starving the rest. Awareness turns a hidden penalty into predictable local access.
Key idea
In a NUMA system memory latency depends on which socket owns the data, so placing data near the thread that uses it through first touch and pinning is key to performance.