All RLHF-like language model RL methods have this to prevent AI Reward Hacking.
Language Model RL Methods
Language Model Reward Benchmarks
Reinforcement Learning Transformers
Language Model RL Frameworks
Era of Experience
However, it has low Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
While the provocative title is not exactly correct, it provides insight even for Multimodality
Appendix is awesome for Language Model RL
Agent RL vulnerability
Search LLMs trained with agentic RL may appear safe, but can be easily jailbroken by manipulating the timing of the search step. The RL objective itself fails to suppress harmful queries, making "search first" behavior a critical vulnerability.

Seonglae Cho