When you must render HTML
Sometimes you genuinely need to render user supplied HTML, such as comments with formatting. Plain text encoding would destroy the formatting, so instead you sanitize, which means parsing the markup and removing anything that could execute. The goal is to keep safe tags while dropping scripts and dangerous attributes.
- Parse the input into a document tree first.
- Keep an allow list of safe tags and attributes.
- Remove scripts, event handlers, and risky urls.
Use a vetted library
Writing a sanitizer with regular expressions is a known trap because attackers hide payloads in malformed markup and unusual encodings. A maintained sanitizer understands the parsing quirks browsers apply and closes the gaps you would miss.
- Prefer an allow list over a block list approach.
- Strip on event handlers and javascript scheme urls.
- Run the sanitizer on the same data the browser will render.
Sanitizing pairs well with a content security policy so a slipped tag still cannot load attacker script.
Key idea
To render user HTML safely, parse and sanitize it with a vetted allow list library rather than block listing tags with regular expressions.