Truncating Belief-Trapped Trajectories
During Active Reasoning, the model’s internal beliefs can drift away from the true state, causing belief deviation. This deviation can push the model into a belief-trap region (BTR) where it produces meaningless or repetitive actions, which in turn contaminates credit assignment for the early, useful parts of trajectories during reinforcement learning (RL).
To address this, we propose T3 (Truncating Belief-Trapped Trajectories), a method that detects when a trajectory enters a belief-trap state and blocks it. Using proxy signals to detect BTR entry, T3 immediately truncates the training trajectory, preventing the information-free tail data from distorting gradient signals. Concretely, in the GAE (Generalized Advantage Estimation) return computation T3 prevents meaningless $delta_u$ terms from accumulating, improving the stability of policy optimization.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
Active reasoning requires large language model (LLM) agents to interact with external sources and strategically gather information to solve problems in multiple turns. Central to this process is...
https://arxiv.org/abs/2510.12264v2


Seonglae Cho