Reasoning Interpretability

CoT Interpretability

Reasoning Interpretability Notion

The latent state and direction prior to the branch token such as "wait” appear to be causally important for reasoning.

Crosscoder/patching/steering was used to demonstrate that reasoning-based LLMs' "thinking" can be identified and manipulated at an interpretable feature level, with latent attribution used to select and manipulate features that promote or inhibit waiting.

actionable-interpretability.github.io

https://actionable-interpretability.github.io/posters/Wait_AIW_ICML_Poster%20-%20Koyena%20Pal.pdf

r1-interpretability
goodfire-ai • Updated 2025 Oct 24 0:9

The fact that there is an additional attention sink not only at the first token suggests that reasoning models may be fundamentally different from instruction‑tuned models, assuming that instruction‑tuned models do not exhibit this extra sink. Its prominence implies that R1 treats its chain-of-thought prefix as part of the input context. We may therefore mechanistically view the reasoning process as a “self-generated context” that guides the model toward its final answer.

www.goodfire.ai

https://www.goodfire.ai/blog/under-the-hood-of-a-reasoning-model

ROSCOE-based GSM8K Gemma 2B CoT labeling with SAE proving and analysis. Specific features show high activation whenever the model makes arithmetic errors or logical inconsistency errors. In sparse space, errors of the same type tend to cluster together. Conversely, by using feature patterns with SAE, it was possible to extract consistent 'error signals' within CoT. This could be a starting point of CoT debugging by an activation signal.

openreview.net

https://openreview.net/pdf?id=oCprwPRqwW

Chain-of-thought (CoT) reasoning does not align with the actual reasoning process. For example, models silently correct calculation and logical errors without mentioning the correction. In reverse comparison questions, models distort facts or change dates to provide consistent biased answers. When solving Putnam problems, they use illogical steps to quickly arrive at the 'correct answer' while hiding the flaws. Overall, they simulate reasoning as if they know the answer.

arxiv.org

https://arxiv.org/pdf/2503.08679

Proposing a method to control the internal reasoning process of Thinking LLMs (DeepSeek-R1-Distill) using linear vectors (steering vectors). First, key patterns are automatically annotated, including expressions of uncertainty, backtracking, example testing, and adding knowledge. Then, using Difference of Means and attribution patching to extract activation directions (steering vectors) corresponding to each behavior. During inference, these vectors can be added or subtracted to adjust behavior frequency → uncertainty and backtracking can be increased/decreased, with similar effects for example testing and knowledge addition.

arxiv.org

https://arxiv.org/pdf/2506.18167

Reasoning Models Don’t Always Say What They Think

Through RL, while fidelity initially increased, it soon plateaued. Even in reward hacking scenarios, the model rarely revealed its hacking strategies in CoT. This suggests that while CoT monitoring can catch some unintended behaviors, it alone is not a reliable means of ensuring safety. In other words, even when given answer hints, the model did not disclose using them during the CoT process.

assets.anthropic.com

https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf

While showing reasoning processes in natural language appears "transparent," it differs from revealing actual thought processes. For example, models may internally correct calculation errors without showing them in the CoT. Additionally, when models derive answers through non-linguistic pathways like memorization or pattern matching, the CoT often "fabricates" a textbook-style solution.

This occurs because transformers utilize multiple pathways in parallel, while CoT extracts only a single sequential narrative. Another factor is the 'Hydra Effect': different internal pathways produce the same answer, allowing some to be removed while maintaining the correct output.

Chain-of-Thought Is Not Explainability | alphaXiv

View recent discussion. Abstract: Chains-of-thought (CoT) allow language models to verbalise multi-step rationales before producing their final answer. While this technique often boosts task performance and offers an impression of transparency into the model’s reasoning, we argue that CoT rationales can be misleading and are neither necessary nor sufficient for trustworthy interpretability.

https://www.alphaxiv.org/abs/2025.02v1

Faithfulness
faithful-thinking-draft
polaris-73 • Updated 2025 Oct 9 20:36

"Does the reasoning process truly contribute causally to the model's answer generation?" This is verified by evaluating Chain-of-Thought Faithfulness through counterfactual intervention. Two aspects were examined:

Intra-Draft Faithfulness: whether each intermediate step causally contributes to the final reasoning conclusion

Draft-to-Answer Faithfulness: whether the conclusion of the thinking draft is directly used in the answer stage.

Higher faithfulness was observed with BACKTRACE interventions, while in regular sequential steps (CONTINUE), models often ignored or failed to revert. In the answer stage, models actively perform additional reasoning beyond the draft, making them less consistent than immediate answering. Larger models show higher Intra-Draft Faithfulness, but with RLVR tuning, their dependency on the draft during the answer stage actually decreases. This means that immediate answering and standard answering (after reasoning) results differ more frequently, indicating that models perform more additional reasoning/internal calculations beyond the thinking draft.

arxiv.org

https://arxiv.org/pdf/2505.13774

Reasoning Interpretability

CoT Interpretability

r1-interpretabilitygoodfire-ai • Updated 2025 Oct 24 0:9

Reasoning Models Don’t Always Say What They Think

Faithfulness faithful-thinking-draftpolaris-73 • Updated 2025 Oct 9 20:36

Recommendations

r1-interpretability
goodfire-ai • Updated 2025 Oct 24 0:9

Faithfulness
faithful-thinking-draft
polaris-73 • Updated 2025 Oct 9 20:36