Reasoning Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 May 10 17:13
Editor
Edited
Edited
2025 Aug 1 17:21

CoT Interpretability

 
 
 
 
The latent state and direction prior to the branch token such as "wait” appear to be causally important for reasoning.
Crosscoder
/patching/steering was used to demonstrate that reasoning-based LLMs' "thinking" can be identified and manipulated at an interpretable feature level, with latent attribution used to select and manipulate features that promote or inhibit waiting.

r1-interpretability
goodfire-aiUpdated 2025 Jul 26 21:46

The fact that there is an additional attention sink not only at the first token suggests that reasoning models may be fundamentally different from instruction‑tuned models, assuming that instruction‑tuned models do not exhibit this extra sink. Its prominence implies that R1 treats its chain-of-thought prefix as part of the input context. We may therefore mechanistically view the reasoning process as a “self-generated context” that guides the model toward its final answer.
ROSCOE
-based GSM8K Gemma 2B CoT labeling with SAE proving and analysis. Specific features show high activation whenever the model makes arithmetic errors or logical inconsistency errors. In sparse space, errors of the same type tend to cluster together. Conversely, by using feature patterns with SAE, it was possible to extract consistent 'error signals' within CoT. This could be a starting point of CoT debugging by an activation signal.
Chain-of-thought (CoT) reasoning does not align with the actual reasoning process. For example, models silently correct calculation and logical errors without mentioning the correction. In reverse comparison questions, models distort facts or change dates to provide consistent biased answers. When solving Putnam problems, they use illogical steps to quickly arrive at the 'correct answer' while hiding the flaws. Overall, they simulate reasoning as if they know the answer.
Proposing a method to control the internal reasoning process of Thinking LLMs (DeepSeek-R1-Distill) using linear vectors (steering vectors). First, key patterns are automatically annotated, including expressions of uncertainty, backtracking, example testing, and adding knowledge. Then, using Difference of Means and attribution patching to extract activation directions (steering vectors) corresponding to each behavior. During inference, these vectors can be added or subtracted to adjust behavior frequency → uncertainty and backtracking can be increased/decreased, with similar effects for example testing and knowledge addition.
 
 
 

Recommendations