Four audit teams were provided with the model and partial training data (one team was given API access only). All three teams with access to data and model internals discovered the hidden objectives, while the team with API-only access failed, suggesting that training data access is crucial for successful auditing. However, both interpretation techniques using Sparse Autoencoders (SAEs) and simple semantic search proved effective in exploring RM bias data, demonstrating that causal analysis through internal activation (feature) manipulation is promising for uncovering model's hidden objectives.
- SAE activations are summed across all tokens and ranked by magnitude (middle-layer)
- Analyze causal relationships between top-ranked features
arxiv.org
https://arxiv.org/pdf/2503.10965
Auditing language models for hidden objectives
A collaboration between Anthropic's Alignment Science and Interpretability teams
https://www.anthropic.com/research/auditing-hidden-objectives

arxiv.org
https://arxiv.org/pdf/2503.10965

Seonglae Cho