Alignment Audit

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Apr 27 22:3
Editor
Edited
Edited
2025 May 5 23:46
Four audit teams were provided with the model and partial training data (one team was given API access only). All three teams with access to data and model internals discovered the hidden objectives, while the team with API-only access failed, suggesting that training data access is crucial for successful auditing. However, both interpretation techniques using Sparse Autoencoders (SAEs) and simple semantic search proved effective in exploring RM bias data, demonstrating that causal analysis through internal activation (feature) manipulation is promising for uncovering model's hidden objectives.
  1. SAE activations are summed across all tokens and ranked by magnitude (middle-layer)
  1. Analyze causal relationships between top-ranked features
 
 
 
 
 
 

Recommendations