- Input modification-based defenses
- Output filtering-based defenses
- Prompt engineering defenses
- Execution time refusal
Jailbreaking Defense Methods
Adversarial Training is a method where a model is trained with intentionally crafted adversarial examples to enhance its robustness against attacks.
Adversarial Training
Refusal in LLMs is mediated by a single direction
That means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation.