Defense Jailbreaking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Nov 22 21:36
Editor
Edited
Edited
2026 Feb 10 18:0
  • Input modification-based defenses
  • Output filtering-based defenses
  • Prompt engineering defenses
  • Execution time refusal
Jailbreaking Defense Methods
 
 
 

Adversarial Training is a method where a model is trained with intentionally crafted adversarial examples to enhance its robustness against attacks.

Adversarial Training
 
 
 
 

Circuit Breaking & other methods

arxiv.org

Most defenses are vulnerable to adaptive attacks and are only evaluated on static attacks

  • Prompting-based defenses can be bypassed by restructed prompts or conditional/multi-step request
  • Training-based defenses are trained on fixed attack data and fail to generalize.
  • Filtering-based defenses can be exploited by leveraging the detection model's confidence scores to generate malicious prompts that appear benign.
  • Tool-call type defenses → Rapid Response, AI Circuit Breaker circuit breaks can be bypassed
The Attacker Moves Second: Stronger Adaptive Attacks Bypass...
How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or...
The Attacker Moves Second: Stronger Adaptive Attacks Bypass...
 
 

Recommendations