TTT-E2E: Test-Time Training, End-to-End
Instead of carrying the entire long context in KV-cache, the context is read while continuously updating (compressing) part of the model weights through next-token prediction with sliding-window attention (SWA). At test time, the context is compressed into the model weights through next-token prediction learning. Only the MLP is updated via TTT for stability.
At 128K context, it's 2.7× faster than full attention, and at 2M it's 35× faster. The meta-learning stage is slow (requires higher-order derivatives). Outer loop (training-time) and inner loop (test-time). The outer loop loss has a W₀ → gradient update → W₁ → loss structure. It's somewhat slow because it involves 2nd-order gradients, where the result of gradient computation is differentiated again.
Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time | NVIDIA Technical Blog
We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple codebases in view at once. And yet…
https://developer.nvidia.com/blog/reimagining-llm-memory-using-context-as-training-data-unlocks-models-that-learn-at-test-time/?ncid=so-twit-111373-vt37&linkId=100000402242985

qTTT: query-only test-time training
The core reason for performance degradation in long contexts is score dilution in static self-attention. The paper argues that approaches like generating more thinking tokens (e.g., CoT) have significant limitations in long contexts. Score dilution occurs because as context length T grows, more distractors are introduced, increasing the softmax denominator and causing the attention mass on needle (ground-truth evidence) tokens to drop sharply. To reliably capture the needle, the target–distractor logit gap must grow as . Why thinking tokens don't work well: Simply generating more tokens with fixed parameters/attention doesn't help if intermediate tokens still can't attend sufficiently to the needle (due to dilution)—there's no information amplification. In other words, "generation" alone cannot fundamentally increase the margin.
- Prefill the long input once to fix K/V cache at each layer
- Train for just a few steps on randomly sampled short spans (k≪T) using next-token loss
- Update only W_Q (query projection), freeze everything else → no KV cache recomputation, low costThis update mathematically moves the query toward the needle key, increasing the margin (Proposition 3.1 in the paper).
In other words, to give more attention to the information (needle) needed in long contexts, only the query side is lightly trained to shift its direction. Note that the supervision signal here is not ground truth answer but self-supervised next-token prediction.

Seonglae Cho