ViT
Used as Visual Encoder commonly
Image processing by dividing an image into fixed-size patches, treating each patch as a token. These patches are then linearly embedded, along with 2D position embeddings, to retain spatial information
Vision Transformers
Vision Transformer Notion
de facto
google/vit-base-patch16-224-in21k · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/google/vit-base-patch16-224-in21k
VISION TRANSFORMERS NEED REGISTERS
Register tokens enable interpretable attention maps in all vision transformers Prompt Learning
The Transformer model family
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/docs/transformers/model_summary
Paper page - ConvNets Match Vision Transformers at Scale
Join the discussion on this paper page
https://huggingface.co/papers/2310.16764
Vision LSTM
Vision-LSTM: xLSTM as Generic Vision Backbone
Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to...
https://arxiv.org/abs/2406.04303


Seonglae Cho