- Input modification-based defenses
- Output filtering-based defenses
- Prompt engineering defenses
- Execution time refusal
Jailbreaking Defense Methods
Adversarial Training is a method where a model is trained with intentionally crafted adversarial examples to enhance its robustness against attacks.
Adversarial Training
Refusal in LLMs is mediated by a single direction
That means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation.
Circuit Breaking & other methods
Constitutional Classifiers from Anthropic AI Constitutional AI
Heuristic rules
Most defenses are vulnerable to adaptive attacks and are only evaluated on static attacks
- Prompting-based defenses can be bypassed by restructed prompts or conditional/multi-step request
- Training-based defenses are trained on fixed attack data and fail to generalize.
- Filtering-based defenses can be exploited by leveraging the detection model's confidence scores to generate malicious prompts that appear benign.
- Tool-call type defenses → Rapid Response, AI Circuit Breaker circuit breaks can be bypassed

Seonglae Cho
