T3

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 May 21 10:23
Editor
Edited
Edited
2026 May 21 10:25

Truncating Belief-Trapped Trajectories

During Active Reasoning, the model’s internal beliefs can drift away from the true state, causing belief deviation. This deviation can push the model into a belief-trap region (BTR) where it produces meaningless or repetitive actions, which in turn contaminates credit assignment for the early, useful parts of trajectories during reinforcement learning (RL).
To address this, we propose T3 (Truncating Belief-Trapped Trajectories), a method that detects when a trajectory enters a belief-trap state and blocks it. Using proxy signals to detect BTR entry, T3 immediately truncates the training trajectory, preventing the information-free tail data from distorting gradient signals. Concretely, in the GAE (Generalized Advantage Estimation) return computation T3 prevents meaningless $delta_u$ terms from accumulating, improving the stability of policy optimization.
 
 
 
 
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
Active reasoning requires large language model (LLM) agents to interact with external sources and strategically gather information to solve problems in multiple turns. Central to this process is...
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
 
 

Recommendations