Vision Transformer

ViT

Used as Visual Encoder commonly

Image processing by dividing an image into fixed-size patches, treating each patch as a token. These patches are then linearly embedded, along with 2D position embeddings, to retain spatial information

Vision Transformers

ConvNeXT

V-MoEs

SeNaTra

Vision Transformer Notion

ViT Vision Encoder

Visual Autoregressive Transformer

VISION TRANSFORMERS NEED REGISTERS

Prompt Learning

arxiv.org

https://arxiv.org/pdf/2309.16588.pdf

The Transformer model family

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/docs/transformers/model_summary

Paper page - ConvNets Match Vision Transformers at Scale

Join the discussion on this paper page

https://huggingface.co/papers/2310.16764

Vision LSTM

Vision-LSTM: xLSTM as Generic Vision Backbone

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to...

https://arxiv.org/abs/2406.04303

Vision Transformer

ViT

VISION TRANSFORMERS NEED REGISTERS

Backlinks

Recommendations