AI Redteaming

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Nov 27 22:26
Editor
Edited
Edited
2025 Oct 9 10:8

Classifier Defense

In-contest defenses

ICL, Spotlighting, Paraphrasing, Warning

Tips

  • Stronger models don't necessarily mean better security (models that follow instructions better are often more vulnerable to attacks)
  • Adaptive attack evaluation is essential; static evaluations alone tend to overestimate defense capabilities
  • Adversarial training is useful and can be implemented without degrading performance
  • Complete defense through model training alone is impossible; defense is needed at all layers: system, model, and infrastructure
  • Evaluation should focus on actual harm, measuring Attack Success Rate (ASR) based on real damage criteria
AI Red teaming methods
 
https://arxiv.org/pdf/2209.07858
 
 

Automated red teaming

Multi-step RL with Rule-Based Reward and Indirect Prompt Injection for Safety Jailbreaking was effective for achieving high ASR (Attack Success Rate)
External red teaming architecture
RLHF models are increasingly difficult to red team as they scale (Antrhopic)
Wide 100 models research done by Microsoft

Google lessens

 

Recommendations