Classifier Defense
In-contest defenses
ICL, Spotlighting, Paraphrasing, Warning
Tips
- Stronger models don't necessarily mean better security (models that follow instructions better are often more vulnerable to attacks)
- Adaptive attack evaluation is essential; static evaluations alone tend to overestimate defense capabilities
- Adversarial training is useful and can be implemented without degrading performance
- Complete defense through model training alone is impossible; defense is needed at all layers: system, model, and infrastructure
- Evaluation should focus on actual harm, measuring Attack Success Rate (ASR) based on real damage criteria
AI Red teaming methods

Automated red teaming
Multi-step RL with Rule-Based Reward and Indirect Prompt Injection for Safety Jailbreaking was effective for achieving high ASR (Attack Success Rate)
External red teaming architecture
RLHF models are increasingly difficult to red team as they scale (Antrhopic)
Wide 100 models research done by Microsoft