AI Reward Hacking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 18 9:29
Editor
Edited
Edited
2025 Mar 17 0:27
Refs
A type of safety problem, can emerge in such a
Transformer Phase Change
Thus, studying a phase change “up close” and better understanding its internal mechanics could contain generalizable lessons for addressing safety problems in future systems.
In particular, the phase change we observe forms an interesting potential bridge between the microscopic domain of interpretability and the macroscopic domain of scaling laws and learning dynamics.
 
 
Analyze the model's internal workings to evaluate the process itself rather than the final output, preventing reward errors. (
Mechanistic interpretability
,
AI Evaluation
)
 
 

Recommendations