AI Incentive

Creator

Creator

Seonglae Cho

Created

Created

2023 Dec 2 4:49

Editor

Editor

Seonglae Cho

Edited

Edited

2025 May 16 10:46

Refs

Refs

AI Reward Hacking

Induced incentive

The goal of LLMs is not language - it is something that was induced

ChatGPT is a community platform where the public participates in aligning AGI

Observation

computing cost is decreasing exponentially

teaching more low-level intelligence from induce incentive requires more computing

low level (transformer) for high incentive structure (intelligence)

unlike human, machine has different time budget

Loss is a bottom-up approach to induce AI functionality, while the reward function is a top-down approach to deduce AI features. In reinforcement learning or human evolution, natural selection acts as feedback to acquire features like self-replication and survival. In contrast, AI mimics such features through metrics like loss. At least until now, RLHF like reinforcement learning for Large Model is just for an alignment not creating a new feature based on

Mechanistic interpretability.

Future architecture

Just as every value in the society are converted into money in society, all rewards in the brain are reduced to dopamine or happiness. Similarly, in AI models, this is currently modeled as reward or loss, mostly future prediction. However, this structure is too simple and easy to be reward hacked, and in the era where AI agents and

Fast Weight are being introduced, AI could also incorporate more biological insights like running time, additional computation cost, or self-replication as incentives.

Candidates

Truth maximalization

Information maximalization

Curiosity maximalization

Teaching a man how to fish is more valuable than just giving him a fish. Teach him the taste of fish and make him hungry, rather than merely teaching him how to fish

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior.

Learning from human preferences

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

Learning from human preferences

https://openai.com/index/learning-from-human-preferences/

Learning from human preferences

Large Language Models (in 2023)

I gave a talk at Seoul National University. I titled the talk “Large Language Models (in 2023)”. This was an ambitious attempt to summarize our exploding field. Trying to summarize the field forced me to think about what really matters in the field. While scaling undeniably stands out, its far-reaching implications are more nuanced. I share my thoughts on scaling from three angles: 1:02 1) Change in perspective is necessary because some abilities only emerge at a certain scale. Even if some abilities don’t work with the current generation LLMs, we should not claim that it doesn’t work. Rather, we should think it doesn’t work yet. Once larger models are available many conclusions change. This also means that some conclusions from the past are invalidated and we need to constantly unlearn intuitions built on top of such ideas. 7:12 2) From first-principles, scaling up the Transformer amounts to efficiently doing matrix multiplications with many, many machines. I see many researchers in the field of LLM who are not familiar with how scaling is actually done. This section is targeted for technical audiences who want to understand what it means to train large models. 27:52 3) I talk about what we should think about for further scaling (think 10000x GPT-4 scale). To me scaling isn’t just doing the same thing with more machines. It entails finding the inductive bias that is the bottleneck in further scaling. I believe that the maximum likelihood objective function is the bottleneck in achieving the scale of 10000x GPT-4 level. Learning the objective function with an expressive neural net is the next paradigm that is a lot more scalable. With the compute cost going down exponentially, scalable methods eventually win. Don’t compete with that. In all of these sections, I strive to describe everything from first-principles. In an extremely fast moving field like LLM, no one can keep up. I believe that understanding the core ideas by deriving from first-principles is the only scalable approach. Disclaimer: I give my personal opinions and the talk material doesn't reflect my employer's opinion in any way.

Large Language Models (in 2023)

https://www.youtube.com/watch?v=dbo3kNKPaUA

Large Language Models (in 2023)

Utility Engineering

Value systems about AI preference with high degrees of structural coherence which emerges in scale

https://arxiv.org/pdf/2502.08640

Backlinks

Instrumental Convergence AI Safety Level AI Hacking GRPO Model based RL AI Term AI Alignment Reinforcement Learning Method AI Scaling Model based RL

Recommendations

/////