Input filtering catches the obvious stuff. But it does nothing against indirect injection — a malicious instruction buried in a PDF your RAG pipeline just ingested. That payload never touches your input filter.
The real control surface isn't the prompt. It's what your model is allowed to do once it's already compromised.
Four things I'd argue matter more than input sanitization:
→ Tool allowlists instead of open function calling
→ Scoped credentials per agent action (not one super-key)
→ Human-in-the-loop on anything that mutates state
→ Output validation before results touch a downstream system
The mental model shift: assume the prompt is already compromised. Then design the blast radius.
Curious where everyone here lands on this — are you filtering inputs, constraining actions, or both? Drop your setup in the comments 👇