VMZ: Model Zoo for Video Modeling
Phi-3.5 for Mac: Locally-run Vision and Language Models
Chinese and English multimodal conversational language model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Qwen3-omni is a natively end-to-end, omni-modal LLM
General-purpose image editing model that delivers high-fidelity
Multimodal embedding and reranking models built on Qwen3-VL
Inference script for Oasis 500M
ICLR2024 Spotlight: curation/training code, metadata, distribution
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Towards Real-World Vision-Language Understanding
OCR expert VLM powered by Hunyuan's native multimodal architecture
Large-language-model & vision-language-model based on Linear Attention
Chat & pretrained large vision language model
Official code for Style Aligned Image Generation via Shared Attention
A latent text-to-image diffusion model
PyTorch implementation of MAE
GLIDE: a diffusion-based text-conditional image synthesis model
Vision-language-action model for robot control via images and text