Qwen2.5-VL-3B-Instruct is a 3.75 billion parameter multimodal model by Qwen, designed to handle complex vision-language tasks in both image and video formats. As part of the Qwen2.5 series, it supports image-text-to-text generation with capabilities like chart reading, object localization, and structured data extraction. The model can serve as an intelligent visual agent capable of interacting with digital interfaces and understanding long-form videos by dynamically sampling resolution and frame rate. It uses a SwiGLU and RMSNorm-enhanced ViT architecture and introduces mRoPE updates for robust temporal and spatial understanding. The model supports flexible image input (file path, URL, base64) and outputs structured responses like bounding boxes or JSON, making it highly versatile in commercial and research settings. It excels in a wide range of benchmarks such as DocVQA, InfoVQA, and AndroidWorld control tasks.
Features
- Handles multimodal input: text, image, video, charts, and layouts
- Supports structured output (e.g., JSON for invoices or tables)
- Visual agent capabilities for UI interaction and digital tool control
- Long video comprehension with event pinpointing
- Dynamic image/video resolution and FPS support
- FlashAttention 2 support for efficient multi-modal inference
- Supports visual localization via bounding boxes and coordinates
- Integrated with Hugging Face Transformers and qwen-vl-utils