Qwen2.5-VL is the multimodal large language model series
Capable of understanding text, audio, vision, video
Official inference repo for FLUX.1 models
Multimodal-Driven Architecture for Customized Video Generation
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Qwen3 is the large language model series developed by Qwen team
Large-language-model & vision-language-model based on Linear Attention
Chinese and English multimodal conversational language model
Designed for text embedding and ranking tasks
High-Resolution 3D Assets Generation with Large Scale Diffusion Models
HY-Motion model for 3D character animation generation
LLM-based Reinforcement Learning audio edit model
Open-source multi-speaker long-form text-to-speech model
tiktoken is a fast BPE tokeniser for use with OpenAI's models
Diffusion Transformer with Fine-Grained Chinese Understanding
The official repo of Qwen chat & pretrained large language model
Visual Causal Flow
Generate Any 3D Scene in Seconds
CogView4, CogView3-Plus and CogView3(ECCV 2024)
OCR expert VLM powered by Hunyuan's native multimodal architecture
Unified Multimodal Understanding and Generation Models
Implementation of "MobileCLIP" CVPR 2024
Towards Real-World Vision-Language Understanding
Memory-efficient and performant finetuning of Mistral's models
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming