Multi-head Latent Attention

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 27 15:36
Editor
Edited
Edited
2025 Mar 3 17:22
Compresses the Key and Value vectors of input tokens into lower dimensions. This reduces the dimensions of Key and Value, decreasing KV cache size and optimizing memory usage.
notion image
 
 
 

FlashMLA

Efficient MLA Decoding Kernel for Hopper GPUs (Optimized for variable-length sequences, battle-tested in production)
 
 
 

Recommendations