Analyze the model's internal workings to evaluate the process itself rather than the final output, preventing reward errors. (Mechanistic interpretability, AI Evaluation)
Internal mechanistic evaluation not final
Creator
Creator
Seonglae ChoCreated
Created
2024 Nov 18 20:50Editor
Editor
Seonglae ChoEdited
Edited
2024 Nov 22 23:37Refs
Refs