Reversing the map
Detokenization turns generated ids back into a string. With reversible schemes like SentencePiece this is clean, but several traps appear in practice.
Whitespace and joining
Tokenizers encode spaces in different ways, often gluing a leading space onto a token. Naively concatenating token strings can drop or duplicate spaces, so detokenization must follow the same convention the tokenizer used.
Split characters and streaming
A multi byte character may straddle two tokens. If you decode and display token by token while streaming, you can momentarily render a broken character until its partner token arrives.
Safe streaming
The fix is to buffer raw bytes and only emit text once a full character is complete, rather than decoding each token in isolation. Special tokens must also be stripped or rendered deliberately so control markers do not leak into user facing output.
Key idea
Detokenization must respect whitespace conventions and buffer split multi byte characters, especially when streaming, to avoid broken or leaked output.