ViT
Used as Visual Encoder commonly
Image processing by dividing an image into fixed-size patches, treating each patch as a token. These patches are then linearly embedded, along with 2D position embeddings, to retain spatial information
Vision Transformers
VISION TRANSFORMERS NEED REGISTERS
Register tokens enable interpretable attention maps in all vision transformers Prompt Learning
Vision LSTM