The Streaming Token Interface
Language models generate one token at a time. Streaming delivers each token to the application as it is produced rather than buffering until the full response is done.
Why stream
- Time to first token is short, so the user sees output almost immediately.
- Perceived latency drops even when total generation time is unchanged.
- Partial output can be processed or canceled early if it goes off track.
How it changes agent design
In an agent, streaming complicates tool calls. The runtime must watch the stream and detect when the model is requesting a tool, sometimes before the full arguments have arrived. Some systems stream visible text to the user while accumulating tool call fragments separately, then execute once the call is complete.
Handling partial state
Streaming means working with incomplete data. A robust consumer buffers fragments, parses only when a unit is complete, and handles interruptions gracefully, such as a dropped connection mid response. The payoff is responsiveness: a streamed agent feels alive and lets users intervene early, which matters most in long running, multi step tasks.
Key idea
Streaming sends tokens as they are generated, cutting perceived latency but requiring the runtime to handle partial output carefully.