In generative AI, jailbreaks, also known as direct prompt injection attacks, are malicious user inputs that attempt to circumvent an AI model’s intended behavior.
A successful jailbreak has potential to subvert all or most responsible AI (RAI) guardrails built into the model through its training by the AI vendor, making risk mitigations across other layers of the AI stack a critical design choice as part of defense in depth.