Instead of selecting the top k activations for each individual sample, we select the top n × k activations across the entire batch of n samples, BatchTopKarxiv.orghttps://arxiv.org/pdf/2412.06410openreview.nethttps://openreview.net/pdf?id=9ca9eHNrdH