Only a small number of model parameters (up to six weights and one activation) have a significant impact on model performance. When these are removed, the model cannot generate text, accuracy drops to random levels, and perplexity surges.
The paper efficiently detects super weights by analyzing activation distributions using a single input prompt.
Super activation
Super weights generate very large activation values at specific positions, which propagate throughout the model.
Application
By preserving super weights and super activations, even simple quantization (round-to-nearest) methods can maintain high quality. The approach of preserving super weights achieves similar performance to existing advanced quantization techniques (like SmoothQuant) while offering the advantage of not requiring any data.