AI Reward Hacking

Created
Created
2024 Apr 18 9:29
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2024 Nov 27 19:39
Refs
A type of safety problem, can emerge in such a
Transformer Phase Change
Thus, studying a phase change “up close” and better understanding its internal mechanics could contain generalizable lessons for addressing safety problems in future systems.
In particular, the phase change we observe forms an interesting potential bridge between the microscopic domain of interpretability and the macroscopic domain of scaling laws and learning dynamics.
 
 
Analyze the model's internal workings to evaluate the process itself rather than the final output, preventing reward errors. (
Mechanistic interpretability
,
AI Evaluation
)
 
 

Recommendations