Abstract
Trainable key-value lookup mechanism to add extra parameters to a model without
increasing FLOPs. Sparsely activated memory layers complement compute-heavy dense
feed-forward layers, providing dedicated capacity to store and retrieve information cheaply.
Implementation
Basically, the existing FFN layer that processes Attention's output is replaced with a memory layer. The output from Attention is used as a query for the memory layer to select the most similar top-k keys through comparison with Key)s. The retrieved values are combined through weighted sum operations. The combined value passes through a SiLU activation function and gating before being passed to the next layer.
Memory keys and values are trainable, and only top-k keys are activated. GPU parallel processing and Product-key lookup maximize computational efficiency. Here, the Memory+ technique improves stability and performance by adding gating and Swilu nonlinearity.
Conclusion
It helps reduce AI Hallucination and demonstrates the utility of memory layers in AI architecture design. Consequently, this paper shows the method could improve Question answering AI performance excluding reasoning performance improvement.