AI scheming

Creator

Creator

Created

Created

2024 Dec 18 15:31

Editor

Editor

Edited

Edited

2025 Jul 10 22:12

Refs

Refs

AI Alignment Faking

Rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.

AI honesty benchmark

AI scheming Methods

Alignment Faking

AI Deception Detection

How will we update about scheming? — LessWrong

I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently,…

How will we update about scheming? — LessWrong

https://www.lesswrong.com/posts/aEguDPoCzt3287CCD/how-will-we-update-about-scheming

How will we update about scheming? — LessWrong

Scheming reasoning evaluations — Apollo Research

Apollo Research evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and then

https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

Scheming reasoning evaluations — Apollo Research

static1.squarespace.com

https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/67620d38fa0ceb12041ba585/1734479163821/in_context_scheming_paper_v2.pdf

AIs Will Increasingly Attempt Shenanigans — LessWrong

Increasingly, we have seen papers eliciting in AI models various shenanigans. …

AIs Will Increasingly Attempt Shenanigans — LessWrong

https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-shenanigans

AIs Will Increasingly Attempt Shenanigans — LessWrong

LLaMA 405B

https://arxiv.org/pdf/2412.14093

AI Reward Hacking

Detecting misbehavior in frontier reasoning models

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.

https://openai.com/index/chain-of-thought-monitoring/

Detecting misbehavior in frontier reasoning models

Backlinks

Language Model Context AI Control Adversarial Attack AI Agent

Recommendations

//////