AI Red teaming

Creator
Creator
Seonglae Cho
Created
Created
2024 Nov 27 22:26
Editor
Edited
Edited
2025 Feb 9 16:35
Refs
Refs
Red Team
AI Red teaming methods
 
https://arxiv.org/pdf/2209.07858
 
 

Automated red teaming

Multi-step RL with Rule-Based Reward and Indirect Prompt Injection for Safety Jailbreaking was effective for achieving high ASR (Attack Success Rate)
External red teaming architecture
RLHF models are increasingly difficult to red team as they scale (Antrhopic)
Wide 100 models research done by Microsoft
 
 

Recommendations