Learning to Memorize at Test Time
They focus on Long term Associative Memory in which we aim to store the past data as the pairs of keys and values. Similar to Transformers, they use two linear layers to project
into a key and value.
Next, they expect our memory module to learn the associations between keys and values. They define the loss as MSE between values and constructed values from the keys.
Accordingly, in the inner loop, we optimize ’s weights, while in the outer-loop, we optimize other parameters () of the entire architecture.
Adaptive Forgetting
When dealing with very large sequences, it is crucial to manage which past information should be forgotten. Adaptive Forgetting mechanism that allows the memory to forget the information that is not needed anymore, resulting in better managing the memory’s limited capacity.
where is the gating mechanism that flexibly controls the memory.
When dealing with very large sequences, it is crucial to manage which past information should be forgotten. Adaptive Forgetting mechanism that allows the memory to forget the information that is not needed anymore, resulting in better managing the memory’s limited capacity.
where is the gating mechanism that flexibly controls the memory.

Memory as a Context

Memory as a Gate

Memory as a Layer

LMM (Long term memory only model)

NIAH says LMM is the best since it only focuses on that
