AI Reward Hacking

Creator

Creator

Created

Created

2024 Apr 18 9:29

Editor

Editor

Edited

Edited

2025 Apr 10 23:32

Refs

Refs

A type of safety problem, can emerge in such a

Neural Network Phase Change

Thus, studying a phase change “up close” and better understanding its internal mechanics could contain generalizable lessons for addressing safety problems in future systems.

In particular, the phase change we observe forms an interesting potential bridge between the microscopic domain of interpretability and the macroscopic domain of scaling laws and learning dynamics.

Analyze the model's internal workings to evaluate the process itself rather than the final output, preventing reward errors. (

Mechanistic interpretability,

Impact stories for model internals: an exercise for interpretability researchers — LessWrong

Inspired by Neel's longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for partici…

Impact stories for model internals: an exercise for interpretability researchers — LessWrong

https://www.lesswrong.com/posts/KfDh7FqwmNGExTryT/impact-stories-for-model-internals-an-exercise-for

Impact stories for model internals: an exercise for interpretability researchers — LessWrong

In-context Learning and Induction Heads

As Transformer generative models continue to scale and gain increasing real world use , addressing their associated safety problems becomes increasingly important. Mechanistic interpretability – attempting to reverse engineer the detailed computations performed by the model – offers one possible avenue for addressing these safety issues. If we can understand the internal structures that cause Transformer models to produce the outputs they do, then we may be able to address current safety problems more systematically, as well as anticipating safety problems in future more powerful models. Note that mechanistic interpretability is a subset of the broader field of interpretability, which encompasses many different methods for explaining the outputs of a neural network. Mechanistic interpretability is distinguished by a specific focus on trying to systematically characterize the internal circuitry of a neural net.

In-context Learning and Induction Heads

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html#model-analysis-table

The Perfect Blend: Redefining RLHF with Mixture of Judges

Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to...

https://arxiv.org/abs/2409.20370

AI Reward Hacking

Detecting misbehavior in frontier reasoning models

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.

https://openai.com/index/chain-of-thought-monitoring/

Detecting misbehavior in frontier reasoning models

Backlinks

AI scheming GRPO AI Auditing AI Safety GRPO AI Problem Reward model

Recommendations

//////