clip-vit-base-patch32 is a zero-shot image classification model from OpenAI based on the CLIP (Contrastive Language–Image Pretraining) framework. It uses a Vision Transformer with base size and 32x32 patches (ViT-B/32) as the image encoder and a masked self-attention transformer as the text encoder. These components are jointly trained using contrastive loss to align images and text in a shared embedding space. The model excels in generalizing across tasks without additional fine-tuning by computing similarity between images and natural language prompts. Trained on a large corpus of image-text pairs sourced from the internet, CLIP enables flexible and interpretable image classification. While effective for research and robustness testing, OpenAI advises against commercial or surveillance use without domain-specific evaluations due to fairness, bias, and performance variability across class taxonomies.
Features
- Vision Transformer architecture (ViT-B/32)
- Contrastive learning on image-text pairs
- Enables zero-shot image classification
- Trained on large-scale internet datasets
- Text-image similarity via shared embedding space
- High accuracy on benchmarks like CIFAR10, ImageNet, Food101
- Fairface-based demographic performance evaluations
- Available in PyTorch, TensorFlow, and JAX