This sequence modeling framework explains Seq2Seq models, including various Transformer Model, from the perspective of Associative Learning rather than RNN. Based on associative memory, it defines Memorization as a weighted regression problem with key-value pairs and Retrieval: understanding it as applying the learned regression function to a query to extract values.
It derives existing techniques such as linear attention, state space models, fast-weight, online learners, and softmax attention as special cases, while presenting mathematical foundations for query-key normalization and new design possibilities like local polynomial attention extensions. Specifically, when derived using linear least squares, it can be run in a recursive form using Woodbury updates or LMS updates, making it appear like an RNN, but softmax attention and other non-recurrent models can be explained using the same principle.