The Operator Scheduling

The graph behind a model

A model is a graph of operators with dependencies: an operator can run only after its inputs are ready. Operator scheduling decides the order operators execute and how they overlap, aiming to keep the GPU fully busy.

Streams and overlap

GPUs support multiple streams, independent queues of work that can run concurrently. A scheduler can:

Overlap data transfer with computation so copies hide behind math.
Run independent operators on different streams.
Reorder work to reduce idle gaps between dependent operators.

Dependencies drive the order

Memory aware scheduling

Scheduling also affects peak memory. Running operators in an order that frees intermediate buffers early lets a model fit in less memory. Compilers balance parallelism against memory, sometimes serializing work to stay within budget.

Why it matters

Even with fast kernels, a poor schedule leaves the device idle waiting on dependencies or transfers. Good scheduling turns a sequence of operators into a tightly packed pipeline.

Key idea

Operator scheduling orders and overlaps a model graph across streams to hide transfers and fill idle gaps while respecting dependencies and memory limits.