Towards Real-World Vision-Language Understanding
High-Resolution Image Synthesis with Latent Diffusion Models
High-Resolution 3D Assets Generation with Large Scale Diffusion Models
Diffusion Transformer with Fine-Grained Chinese Understanding
OCR expert VLM powered by Hunyuan's native multimodal architecture
Qwen3-Coder is the code version of Qwen3
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Large Multimodal Models for Video Understanding and Editing
Qwen2.5-VL is the multimodal large language model series
Implementation of "MobileCLIP" CVPR 2024
Chinese and English multimodal conversational language model
The official repo of Qwen chat & pretrained large language model
Repo of Qwen2-Audio chat & pretrained large audio language model
Unified Multimodal Understanding and Generation Models
The official PyTorch implementation of Google's Gemma models
Multimodal Diffusion with Representation Alignment
Official code for Style Aligned Image Generation via Shared Attention
Memory-efficient and performant finetuning of Mistral's models
Pushing the Limits of Mathematical Reasoning in Open Language Models
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
LLM-based Reinforcement Learning audio edit model
Open-weight, large-scale hybrid-attention reasoning model
Phi-3.5 for Mac: Locally-run Vision and Language Models
Renderer for the harmony response format to be used with gpt-oss
Multi-modal large language model designed for audio understanding