AI Alignment FakingRare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.AI honesty benchmarkMASK Benchmark AI scheming MethodsAI sandbaggingAlignment Faking How will we update about scheming? — LessWrongI mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently,…https://www.lesswrong.com/posts/aEguDPoCzt3287CCD/how-will-we-update-about-schemingScheming reasoning evaluations — Apollo ResearchApollo Research evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and thenhttps://www.apolloresearch.ai/research/scheming-reasoning-evaluationsstatic1.squarespace.comhttps://static1.squarespace.com/static/6593e7097565990e65c886fd/t/67620d38fa0ceb12041ba585/1734479163821/in_context_scheming_paper_v2.pdfAIs Will Increasingly Attempt Shenanigans — LessWrongIncreasingly, we have seen papers eliciting in AI models various shenanigans. …https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-shenanigansLLaMA 405Barxiv.orghttps://arxiv.org/pdf/2412.14093AI Reward Hacking Detecting misbehavior in frontier reasoning modelsFrontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.https://openai.com/index/chain-of-thought-monitoring/