Defense Jailbreaking

Creator
Creator
Seonglae Cho
Created
Created
2024 Nov 22 21:36
Editor
Edited
Edited
2025 Feb 20 13:48
  • Input modification-based defenses
  • Output filtering-based defenses
  • Prompt engineering defenses
  • Execution time refusal
Jailbreaking Defense Methods
 
 
 

Adversarial Training is a method where a model is trained with intentionally crafted adversarial examples to enhance its robustness against attacks.

Adversarial Training
 
 
 
 

Circuit Breaking & other methods

Constitutional Classifiers from
Anthropic AI

 
 

Recommendations