The Artifacts appearing in feature maps are high L2 norm tokens that occur in background regions with low information content in images, showing how the model reuses them for internal calculations. These tokens contain global rather than local information, which hinders the model's interpretability and density.
Therefore, by adding Register Tokens to the input sequence to guide the model to secure separate computational space, performance is improved in downstream tasks such as object detection and density prediction.
Vision Transformers Need Registers
Inside Transformers, vector activation norms show that CLS tokens become excessive attention sinks, causing distortions in visualization and performance. Therefore, register tokens were trained to act as a type of register that stores global image information like global memory.
Vision Transformers Don’t Need Trained Registers
This paper discovers register neurons in MLPs that generate high-norm tokens and directly intervenes by transferring high-norm activations to new tokens early on, achieving similar effects in pre-trained models without additional training (test-time register).

Seonglae Cho