Diffusion Transformer with Fine-Grained Chinese Understanding
Fast stable diffusion on CPU and AI PC
High-Resolution Image Synthesis with Latent Diffusion Models
Chat & pretrained large vision language model
Designed for text embedding and ranking tasks
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Code for running inference and finetuning with SAM 3 model
Chinese and English multimodal conversational language model
RGBD video generation model conditioned on camera input
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Fast-stable-diffusion + DreamBooth
Controllable & emotion-expressive zero-shot TTS
Unified Multimodal Understanding and Generation Models
Qwen2.5-VL is the multimodal large language model series
Recovering the Visual Space from Any Views
Implementation of "MobileCLIP" CVPR 2024
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Official Python inference and LoRA trainer package
Multimodal embedding and reranking models built on Qwen3-VL
Phi-3.5 for Mac: Locally-run Vision and Language Models
High-Resolution Image Synthesis with Latent Diffusion Models
Towards Real-World Vision-Language Understanding
Accurate × Fast × Comprehensive
A Systematic Framework for Interactive World Modeling