blip-image-captioning-large is a vision-language model developed by Salesforce that generates image captions using a large ViT backbone. It is part of the BLIP framework, which unifies vision-language understanding and generation in a single model. The model is trained on the COCO dataset and leverages a bootstrapped captioning strategy using synthetic captions filtered for quality. This approach improves robustness across diverse vision-language tasks, including image captioning, retrieval, and VQA. BLIP-large achieves state-of-the-art performance on benchmarks like CIDEr and VQA accuracy. It supports both conditional and unconditional captioning and generalizes well to new tasks, such as video-language applications in zero-shot settings. With 470 million parameters, it offers a powerful, scalable solution for image-to-text generation.
Features
- Large ViT backbone for improved vision-language modeling
- Pretrained on the COCO dataset for image captioning
- Supports conditional and unconditional captioning
- Bootstrapped training with synthetic caption filtering
- Achieves SOTA on CIDEr and VQA benchmarks
- Strong generalization to unseen video-language tasks
- Available in PyTorch with support for float16 and CPU/GPU inference
- Part of the unified BLIP framework for image-text generation and understanding