DFC

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 May 22 15:20
Editor
Edited
Edited
2026 May 22 15:24
Refs
Refs

Dedicated Feature
Crosscoder

  • Enables comparisons even between models with different architectures.
  • Like software git diff, it finds only the “behavioral differences” between models. Traditional benchmarks measure only known risks, whereas this method focuses on uncovering “unknown unknowns.”
  • In Qwen/DeepSeek, it found a “CCP alignment” feature → can censor Tiananmen-related questions and steer toward pro–Chinese government responses.
  • In Llama, it found an “American exceptionalism” feature.
  • In GPT-OSS, it found a “copyright refusal” feature → can control the tendency to refuse copyrighted content.
Feels like an interpretability-based auditing tool that automatically discovers what new risky tendencies a newer model has acquired relative to a previous one. It argues that issues like GPT-4o
AI sycophancy
could have been detected early via this kind of diffing.
 
 
 
 
 
 

Recommendations