풀이과정 데이터셋있나
Pure mech interpretability research
reasoning model dual attention sink
Induction head, copying head 같은 놈들을 어떻게 사용할 수 있을지 head 잘 건드려보면 재밋을듯
Residual space가 유클리드 공간이 아니라 리만기하처럼 동작한다는 관점 . 수학적으로 “회전/구면 보간(slerp)”이나 “norm-preserving update” 같은 형태가 다른 분야에서 흔히 쓰이는 패턴이라는 정도.
Topic
how about - pde, ode, dataset or synthetic dataset by solvers and wolfrem alpha
Train model and see grokking and double deep descent and etc
tokenize in specific way
Wolfram Beta: 이름웃기나
upon that, see training mechanism using attention head types, how pde ode solving specific agnet generatlize or standarzie the process. phase change and in context learning also.
Evolutional Pretraining
Gradient Routing
Gradient Routing SGTM 으로 layer wise evolucationry data curating pretrained model gradient routing like primitive brain
from easy data (physical world) to hard data (logical)
MLP Interpretability
A paper explaining the internal mechanism of the Grokking phenomenon in small neural networks learning modular addition through Fourier features + lottery ticket structure + phase alignment process. What the model actually learns: when a two-layer neural network solves modular addition, each neuron learns a single-frequency Fourier feature. In other words, it solves the problem by transforming it into a periodic signal decomposition problem rather than arithmetic. Previous research only discovered that "neurons learn frequencies," but this paper explains how those features are combined into a complete algorithm and why generalization suddenly occurs. Modular addition is special because it can be completely expressed with Fourier bases, making it possible to precisely analyze the internal mechanism, which is why it was chosen as a toy model.
After the memorization phase, phase alignment aligns the frequencies' phases, causing the entire structure to operate like a single algorithm. Then grokking occurs with an explosion in generalization performance. In other words, Grokking is not about feature discovery but rather about alignment or composition of already-discovered features.
Similar to the Lottery Ticket Hypothesis, there already exists a subnetwork within the network that can implement the correct algorithm. Learning is the process of "activating" that structure.
arxiv.org
https://arxiv.org/pdf/2602.16849
Transformers can perform reasoning even when token meanings are not fixed. In settings where meaning cannot be stored in embeddings, they learn symbolic, relation-based algorithms in-context. In other words, they can infer relationships between tokens without memorizing token meanings. LLMs can create "temporary meanings (dynamic variables)" from context and reason without fixed token semantics.
When training small transformers from scratch (pretraining), models achieve near-perfect accuracy even when token embeddings don't have "hardcoded" values, and generalize well to groups not seen during training (e.g., order-8 groups).
The model learns 3-4 main mechanisms:
- Copying: A head that copies identical facts seen previously (certain heads strongly specialize in this).
- Commutative copying: If
ab=cexists, also copyba=c(when applicable).
- Identity recognition: Detects facts involving the identity element and selects "the answer is the other variable" (combination of 'query promotion' + 'identity demotion').
- Closure-based cancellation: Tracks candidates belonging to the same group (closure), and eliminates impossible answers using shared slot (left/right) facts to leave a unique answer.
Using activation patching/indirect effect analysis, we can identify which attention heads/subspaces implement each function. Copying is especially dominated by almost a single head. Phase Change: During training, loss curves show discrete phase transitions where skills emerge in order: structural tokens ('=', ',') → closure → copying/commute → identity·cancellation → (finally) some associativity.
arxiv.org
https://arxiv.org/pdf/2512.16902
ChatGPT
ChatGPT helps you get answers, find inspiration, and be more productive.
https://chatgpt.com/c/698a16e2-4028-8388-a178-a528f815b75b

2025 In-context learning Jianliang He ICL_linearY-Agent • Updated 2026 Feb 25 15:57
ICL_linear
Y-Agent • Updated 2026 Feb 25 15:57
Multi-head softmax attention learns to internally implement a "debiased gradient descent" algorithm when performing in-context linear regression on linear data. Attention converges to a specific structure (not by chance). During training, the weights of each head align to the following patterns:
- KQ (Key-Query): Diagonal Matrix form before superposition regime
- Decides which samples to look at (based on x)
- OV (Output-Value): Only the last term remains
- Retrieves only the y value from that sample
In other words, one head = a machine that averages y based on x similarity. This is why it appears as kernel regression / nearest neighbor.
- Heads split into positive/negative groups (positive vs negative heads)
- OV sum is nearly 0 (zero-sum)
In other words, attention is organized into a circuit form that implements a specific algorithm rather than being an arbitrary function. Multi-head = sum of kernel regressions. Each head performs: weight to similar average calculation. That is, one head = kernel regressor.
However, when two or more heads are introduced, the algorithm suddenly changes. 1 head → non-parametric kernel regression (slow, inefficient). heads → approximates gradient descent predictor. Multi-head attention ≈ one GD update step. Mathematically: . In other words, equivalent to the result of one step GD on the training set. At a high level, This suggests that the Transformer discovers and implements the GD algorithm during training.
Why are positive/negative heads necessary? The two heads cancel each other out to: remove bias (debiased GD), improve performance, approach Bayes optimal. positive head − negative head = gradient descent update (debiased GD approximation)
It also learns the data distribution:
- isotropic → general GD
- anisotropic → preconditioned GD (reflects )
- multi-task → heads are distributed across tasks (superposition)
In other words, attention: converts even data statistical structure into internal algorithms. Put differently, the claim is that the Transformer's ICL ability is not "pattern recognition" but "algorithm implementation." Multi-head softmax attention ≈ debiased gradient descent solver. However, these results are derived from a highly simplified linear task, leaving open whether similar algorithmic structures emerge in realistic multi-layer transformers trained on natural language.
arxiv.org
https://arxiv.org/pdf/2503.12734
Seonglae Cho