VisualGLM-6B is an open-source multimodal conversational language model developed by ZhipuAI that supports both images and text in Chinese and English. It builds on the ChatGLM-6B backbone, with 6.2 billion language parameters, and incorporates a BLIP2-Qformer visual module to connect vision and language. In total, the model has 7.8 billion parameters. Trained on a large bilingual dataset — including 30 million high-quality Chinese image-text pairs from CogView and 300 million English pairs — VisualGLM-6B is designed for image understanding, description, and question answering. Fine-tuning on long visual QA datasets further aligns the model’s responses with human preferences. The repository provides inference APIs, command-line demos, web demos, and efficient fine-tuning options like LoRA, QLoRA, and P-tuning. It also supports quantization down to INT4, enabling local deployment on consumer GPUs with as little as 6.3 GB VRAM.
Features
- 7.8B parameter multimodal conversational model (6.2B language + vision module)
- Supports Chinese and English image-based dialogue
- Pretrained on 330M bilingual image-text pairs for strong alignment
- Fine-tuning support via LoRA, QLoRA, and P-tuning for domain-specific tasks
- Efficient INT4 quantization allows inference with only 6.3 GB GPU memory
- Provides CLI demos, web demos, and REST API deployment options