Four audit teams were provided with the model and partial training data (one team was given API access only). All three teams with access to data and model internals discovered the hidden objectives, while the team with API-only access failed, suggesting that training data access is crucial for successful auditing. However, both interpretation techniques using Sparse Autoencoders (SAEs) and simple semantic search proved effective in exploring RM bias data, demonstrating that causal analysis through internal activation (feature) manipulation is promising for uncovering model's hidden objectives.
- SAE activations are summed across all tokens and ranked by magnitude (middle-layer)
- Analyze causal relationships between top-ranked features