Compresses the Key and Value vectors of input tokens into lower dimensions. This reduces the dimensions of Key and Value, decreasing KV cache size and optimizing memory usage.
FlashMLA
Efficient MLA Decoding Kernel for Hopper GPUs (Optimized for variable-length sequences, battle-tested in production)