DFC

Dedicated Feature
Crosscoder

Enables comparisons even between models with different architectures.

Like software git diff, it finds only the “behavioral differences” between models. Traditional benchmarks measure only known risks, whereas this method focuses on uncovering “unknown unknowns.”

In Qwen/DeepSeek, it found a “CCP alignment” feature → can censor Tiananmen-related questions and steer toward pro–Chinese government responses.

In Llama, it found an “American exceptionalism” feature.

In GPT-OSS, it found a “copyright refusal” feature → can control the tendency to refuse copyrighted content.

Feels like an interpretability-based auditing tool that automatically discovers what new risky tendencies a newer model has acquired relative to a previous one. It argues that issues like GPT-4o

AI sycophancy could have been detected early via this kind of diffing.

DFC

Dedicated FeatureCrosscoder

Recommendations

Dedicated Feature
Crosscoder