‣
AI Red teaming methods
Automated red teaming
Multi-step RL with Rule-Based Reward and Indirect Prompt Injection for Safety Jailbreaking was effective for achieving high ASR (Attack Success Rate)
External red teaming architecture
RLHF models are increasingly difficult to red team as they scale (Antrhopic)
Wide 100 models research done by Microsoft