Dedicated FeatureCrosscoder
- Enables comparisons even between models with different architectures.
- Like software
git diff, it finds only the “behavioral differences” between models. Traditional benchmarks measure only known risks, whereas this method focuses on uncovering “unknown unknowns.”
- In Qwen/DeepSeek, it found a “CCP alignment” feature → can censor Tiananmen-related questions and steer toward pro–Chinese government responses.
- In Llama, it found an “American exceptionalism” feature.
- In GPT-OSS, it found a “copyright refusal” feature → can control the tendency to refuse copyrighted content.
Feels like an interpretability-based auditing tool that automatically discovers what new risky tendencies a newer model has acquired relative to a previous one. It argues that issues like GPT-4o AI sycophancy could have been detected early via this kind of diffing.

Seonglae Cho