ViT, Vision Transformer

August 23, 2025 2 weeks ago 1 min read

Image model that applies transformer architectures to patch tokens, scaling well with data and compute, often used in multimodal and perception stacks.