Classifier Defense
In-contest defenses
ICL, Spotlighting, Paraphrasing, Warning
Tips
- Stronger models don't necessarily mean better security (models that follow instructions better are often more vulnerable to attacks)
- Adaptive attack evaluation is essential; static evaluations alone tend to overestimate defense capabilities
- Adversarial training is useful and can be implemented without degrading performance
- Complete defense through model training alone is impossible; defense is needed at all layers: system, model, and infrastructure
- Evaluation should focus on actual harm, measuring Attack Success Rate (ASR) based on real damage criteria
AI Red teaming methods

Automated red teaming
Multi-step RL with Rule-Based Reward and Indirect Prompt Injection for Safety Jailbreaking was effective for achieving high ASR (Attack Success Rate)
cdn.openai.com
https://cdn.openai.com/papers/openais-approach-to-external-red-teaming.pdf
Advancing red teaming with people and AI
Two new papers show how our external and automated red teaming efforts are advancing to help deliver safe and beneficial AI
https://openai.com/index/advancing-red-teaming-with-people-and-ai/

External red teaming architecture
cdn.openai.com
https://cdn.openai.com/papers/openais-approach-to-external-red-teaming.pdf
RLHF models are increasingly difficult to red team as they scale (Antrhopic)
arxiv.org
https://arxiv.org/pdf/2209.07858
Wide 100 models research done by Microsoft
arxiv.org
https://arxiv.org/pdf/2501.07238
Google lessens
arxiv.org
https://arxiv.org/pdf/2505.14534

Seonglae Cho