Surprise Score

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 18 13:32
Editor
Edited
Edited
2025 Mar 16 18:41

An data that violates the expectations is more memorable for humans.

Inspired by this, a simple definition of surprise for a model can be its gradient with respect to the input. The larger the gradient is, the more different the input data is from the past data.
The key idea to train a long-term memory is to treat its training as an online learning problem, in which we aim to compress the past information into the parameters of long-term neural memory module
This surprise metric, however, can result in missing important information that comes after a big surprising moment. where is a data-dependent surprise decay and the term is controlling how much of momentary surprise should be incorporated into the final surprise metric in a data dependent manner.
They focus on Long term
Associative Memory
in which we aim to store the past data as the pairs of keys and values. Similar to Transformers, they use two linear layers to project into a key and value.
Next, they expect our memory module to learn the associations between keys and values. They define the loss as MSE between values and constructed values from the keys.
Accordingly, in the inner loop, we optimize ’s weights, while in the outer-loop, we optimize other parameters () of the entire architecture.

Adaptive Forgetting

When dealing with very large sequences, it is crucial to manage which past information should be forgotten. Adaptive Forgetting mechanism that allows the memory to forget the information that is not needed anymore, resulting in better managing the memory’s limited capacity.
where is the gating mechanism that flexibly controls the memory.
 
 

How one can retrieve information from the memory?

 
 
 
 
 
 

Recommendations