Synchronization is the cost
For many data structures the expensive part is not the work itself but the synchronization around it. If a hundred threads each fight for the same lock or cache line, the contention dwarfs the actual updates. Flat combining flips this by serializing the work cheaply through one thread.
Publish, then let a combiner act
Each thread owns a slot in a publication list where it posts its requested operation:
- A thread publishes its request and then tries to grab the role of combiner
- Whichever thread wins the combiner role scans the publication list and applies every posted request to the structure itself
- Other threads simply spin on their own slot until the combiner marks their request done
Why batching helps
The combiner touches the structure single threaded, so there is no internal locking or cache line ping pong during the batch. It also gains a chance to optimize across requests, for instance collapsing a push and pop, or applying many updates while the data is hot in its cache. The cost of synchronization is amortized over the whole batch.
Key idea
Flat combining has one thread batch and apply everyone's published operations single threaded, amortizing synchronization cost and enabling cross request optimization.