← Lessons

quiz vs the machine

Platinum1830

Machine Learning

Detokenization Issues

Why turning tokens back into text is trickier than it looks, especially when streaming.

5 min read · advanced · beat Platinum to climb

Reversing the map

Detokenization turns generated ids back into a string. With reversible schemes like SentencePiece this is clean, but several traps appear in practice.

Whitespace and joining

Tokenizers encode spaces in different ways, often gluing a leading space onto a token. Naively concatenating token strings can drop or duplicate spaces, so detokenization must follow the same convention the tokenizer used.

Split characters and streaming

A multi byte character may straddle two tokens. If you decode and display token by token while streaming, you can momentarily render a broken character until its partner token arrives.

Safe streaming

The fix is to buffer raw bytes and only emit text once a full character is complete, rather than decoding each token in isolation. Special tokens must also be stripped or rendered deliberately so control markers do not leak into user facing output.

Key idea

Detokenization must respect whitespace conventions and buffer split multi byte characters, especially when streaming, to avoid broken or leaked output.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can naive token by token decoding break during streaming?

2. What is the recommended fix for safe streaming detokenization?

3. Why must special tokens be handled during detokenization?