Test-time Regression

This sequence modeling framework explains

Seq2Seq models, including various

Transformer Model, from the perspective of

RNN. Based on associative memory, it defines Memorization as a weighted regression problem with key-value pairs and Retrieval: understanding it as applying the learned regression function to a query to extract values.

It derives existing techniques such as linear attention, state space models, fast-weight, online learners, and softmax attention as special cases, while presenting mathematical foundations for query-key normalization and new design possibilities like local polynomial attention extensions. Specifically, when derived using linear least squares, it can be run in a recursive form using Woodbury updates or LMS updates, making it appear like an RNN, but softmax attention and other non-recurrent models can be explained using the same principle.

arxiv.org

https://arxiv.org/pdf/2501.12352

Test-time Regression

Recommendations