AI Redteaming

ICL, Spotlighting, Paraphrasing, Warning

Stronger models don't necessarily mean better security (models that follow instructions better are often more vulnerable to attacks)

Adaptive attack evaluation is essential; static evaluations alone tend to overestimate defense capabilities

Adversarial training is useful and can be implemented without degrading performance

Complete defense through model training alone is impossible; defense is needed at all layers: system, model, and infrastructure

Evaluation should focus on actual harm, measuring Attack Success Rate (ASR) based on real damage criteria

AI Red teaming methods

Multi-step RL with Rule-Based Reward and Indirect Prompt Injection for Safety Jailbreaking was effective for achieving high ASR (Attack Success Rate)

External red teaming architecture

RLHF models are increasingly difficult to red team as they scale (Antrhopic)

Wide 100 models research done by Microsoft