Reasoning Model Explainability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jul 2 13:31
Editor
Edited
Edited
2025 Jul 3 10:16
 
 
 
 
 

Reasoning Models Don’t Always Say What They Think

Through RL, while fidelity initially increased, it soon plateaued. Even in reward hacking scenarios, the model rarely revealed its hacking strategies in CoT. This suggests that while CoT monitoring can catch some unintended behaviors, it alone is not a reliable means of ensuring safety. In other words, even when given answer hints, the model did not disclose using them during the CoT process.
While showing reasoning processes in natural language appears "transparent," it differs from revealing actual thought processes. For example, models may internally correct calculation errors without showing them in the CoT. Additionally, when models derive answers through non-linguistic pathways like memorization or pattern matching, the CoT often "fabricates" a textbook-style solution.
This occurs because transformers utilize multiple pathways in parallel, while CoT extracts only a single sequential narrative. Another factor is the 'Hydra Effect': different internal pathways produce the same answer, allowing some to be removed while maintaining the correct output.

Faithfulness
faithful-thinking-draft
polaris-73Updated 2025 Jun 19 14:34

"Does the reasoning process truly contribute causally to the model's answer generation?" This is verified by evaluating Chain-of-Thought Faithfulness through counterfactual intervention. Two aspects were examined:
  1. Intra-Draft Faithfulness: whether each intermediate step causally contributes to the final reasoning conclusion
  1. Draft-to-Answer Faithfulness: whether the conclusion of the thinking draft is directly used in the answer stage.
Higher faithfulness was observed with BACKTRACE interventions, while in regular sequential steps (CONTINUE), models often ignored or failed to revert. In the answer stage, models actively perform additional reasoning beyond the draft, making them less consistent than immediate answering. Larger models show higher Intra-Draft Faithfulness, but with RLVR tuning, their dependency on the draft during the answer stage actually decreases. This means that immediate answering and standard answering (after reasoning) results differ more frequently, indicating that models perform more additional reasoning/internal calculations beyond the thinking draft.
 
 

Recommendations