Instead of selecting the top k activations for each individual sample, we select the top n × k activations across the entire batch of n samples,
BatchTopK
arxiv.org
https://arxiv.org/pdf/2412.06410
openreview.net
https://openreview.net/pdf?id=9ca9eHNrdH

Seonglae Cho