Reward Misgeneralization

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 23 21:20
Editor
Edited
Edited
2025 Mar 23 21:21
Refs
Refs

Inappropriate generalization of reward signals leading to inaccurate decisions

A phenomenon where the reward function learned during training is misinterpreted when applied to new situations, leading to unintended behaviors that deviate from the original objectives
 
 
 
 
 
 
 

Recommendations