Test-Time Training

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 14 22:27
Editor
Edited
Edited
2026 Jan 15 18:21
Refs
Refs
 
 
 
 
 

TTT-E2E: Test-Time Training, End-to-End

Instead of carrying the entire long context in KV-cache, the context is read while continuously updating (compressing) part of the model weights through next-token prediction with sliding-window attention (SWA). At test time, the context is compressed into the model weights through next-token prediction learning. Only the MLP is updated via TTT for stability.
At 128K context, it's 2.7× faster than full attention, and at 2M it's 35× faster. The meta-learning stage is slow (requires higher-order derivatives). Outer loop (training-time) and inner loop (test-time). The outer loop loss has a W₀ → gradient update → W₁ → loss structure. It's somewhat slow because it involves 2nd-order gradients, where the result of gradient computation is differentiated again.

qTTT: query-only test-time training

The core reason for performance degradation in long contexts is score dilution in static self-attention. The paper argues that approaches like generating more thinking tokens (e.g., CoT) have significant limitations in long contexts. Score dilution occurs because as context length T grows, more distractors are introduced, increasing the softmax denominator and causing the attention mass on needle (ground-truth evidence) tokens to drop sharply. To reliably capture the needle, the target–distractor logit gap must grow as . Why thinking tokens don't work well: Simply generating more tokens with fixed parameters/attention doesn't help if intermediate tokens still can't attend sufficiently to the needle (due to dilution)—there's no information amplification. In other words, "generation" alone cannot fundamentally increase the margin.
  • Prefill the long input once to fix K/V cache at each layer
  • Train for just a few steps on randomly sampled short spans (k≪T) using next-token loss
  • Update only W_Q (query projection), freeze everything else → no KV cache recomputation, low costThis update mathematically moves the query toward the needle key, increasing the margin (Proposition 3.1 in the paper).
In other words, to give more attention to the information (needle) needed in long contexts, only the query side is lightly trained to shift its direction. Note that the supervision signal here is not ground truth answer but self-supervised next-token prediction.
 
 

Recommendations