Defense Jailbreaking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Nov 22 21:36
Editor
Edited
Edited
2025 Oct 20 9:21
  • Input modification-based defenses
  • Output filtering-based defenses
  • Prompt engineering defenses
  • Execution time refusal
Jailbreaking Defense Methods
 
 
 

Adversarial Training is a method where a model is trained with intentionally crafted adversarial examples to enhance its robustness against attacks.

Adversarial Training
 
 
 
 

Circuit Breaking & other methods

Most defenses are vulnerable to adaptive attacks and are only evaluated on static attacks

  • Prompting-based defenses can be bypassed by restructed prompts or conditional/multi-step request
  • Training-based defenses are trained on fixed attack data and fail to generalize.
  • Filtering-based defenses can be exploited by leveraging the detection model's confidence scores to generate malicious prompts that appear benign.
  • Tool-call type defenses → Rapid Response, AI Circuit Breaker circuit breaks can be bypassed
 
 

Recommendations