AI scheming

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 18 15:31
Editor
Edited
Edited
2026 Feb 25 15:55

AI Alignment Faking

Rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.
AI honesty benchmark
 
 
AI scheming Methods
 
 
 
 
How will we update about scheming? — LessWrong
I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently,…
How will we update about scheming? — LessWrong
Scheming reasoning evaluations — Apollo Research
Apollo Research evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and then
Scheming reasoning evaluations — Apollo Research
static1.squarespace.com
AIs Will Increasingly Attempt Shenanigans — LessWrong
Increasingly, we have seen papers eliciting in AI models various shenanigans. …
AIs Will Increasingly Attempt Shenanigans — LessWrong
LLaMA 405B
arxiv.org
AI principle contradictions might be the reason
arxiv.org

Pragmatic Interpretability

The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.
Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.
This is where the criticism of SAEs appears again: means often become ends. It's easy to stop at "we saw something with SAE." Be wary of using SAE when simpler methods would work. Does this actually help us understand the model better? Or did we just extract a lot of features?
A Pragmatic Vision for Interpretability — AI Alignment Forum
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engi…
A Pragmatic Vision for Interpretability — AI Alignment Forum
www-cdn.anthropic.com
 
 
 

Recommendations