Transformer Memory Layer

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 8 15:22
Editor
Edited
Edited
2025 Mar 16 17:33
Refs
Refs

Abstract

Trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply.

Implementation

Basically, the existing FFN layer that processes Attention's output is replaced with a memory layer. The output from Attention is used as a query for the memory layer to select the most similar top-k keys through comparison with Key)s. The retrieved values are combined through weighted sum operations. The combined value passes through a SiLU activation function and gating before being passed to the next layer.
Memory keys and values are trainable, and only top-k keys are activated. GPU parallel processing and Product-key lookup maximize computational efficiency. Here, the Memory+ technique improves stability and performance by adding gating and Swilu nonlinearity.

Conclusion

It helps reduce
AI Hallucination
and demonstrates the utility of memory layers in AI architecture design. Consequently, this paper shows the method could improve
Question answering AI
performance excluding reasoning performance improvement.
 
 
 
 
 
 
 

Recommendations