Transformer Memory Layer

Abstract

Trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply.

Implementation

Basically, the existing FFN layer that processes Attention's output is replaced with a memory layer. The output from Attention is used as a query for the memory layer to select the most similar top-k keys through comparison with Key)s. The retrieved values are combined through weighted sum operations. The combined value passes through a SiLU activation function and gating before being passed to the next layer.

Memory keys and values are trainable, and only top-k keys are activated. GPU parallel processing and Product-key lookup maximize computational efficiency. Here, the Memory+ technique improves stability and performance by adding gating and Swilu nonlinearity.

Conclusion

It helps reduce

AI Hallucination and demonstrates the utility of memory layers in AI architecture design. Consequently, this paper shows the method could improve

Question answering AI performance excluding reasoning performance improvement.

arxiv.org

https://arxiv.org/pdf/2412.09764

Transformer Memory Layer

Abstract

Implementation

Conclusion

Recommendations