BLIP-Image-Captioning-Base is a pre-trained vision-language model developed by Salesforce that generates natural language descriptions of images. Built on the BLIP (Bootstrapping Language-Image Pretraining) framework, it uses a ViT-base backbone and is fine-tuned on the COCO dataset. The model supports both conditional and unconditional image captioning, delivering strong performance across multiple benchmarks including CIDEr and image-text retrieval. It introduces a novel strategy to bootstrap web-sourced noisy image-caption data using synthetic caption generation and noise filtering. BLIP's unified architecture is designed for both vision-language understanding and generation, showing strong generalization even in zero-shot settings. The model can be easily deployed using Hugging Face Transformers in PyTorch or TensorFlow, with support for GPU acceleration and half-precision inference.
Features
- Generates captions for images (both conditional and unconditional)
- Based on ViT-base encoder and BLIP architecture
- Pretrained on COCO and bootstrapped web data
- Achieves strong performance on image-text retrieval, captioning, and VQA
- Supports zero-shot transfer to video-language tasks
- Usable with Hugging Face Transformers (PyTorch, TensorFlow)
- Compatible with GPU and float16 inference
- Licensed under BSD-3-Clause for flexible research use