Controllable & emotion-expressive zero-shot TTS
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
Renderer for the harmony response format to be used with gpt-oss
Long-form streaming TTS system for multi-speaker dialogue generation
Open-source industrial-grade ASR models
Qwen3-ASR is an open-source series of ASR models
OpenTinker is an RL-as-a-Service infrastructure for foundation models
VMZ: Model Zoo for Video Modeling
Large Multimodal Models for Video Understanding and Editing
General-purpose image editing model that delivers high-fidelity
Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI
Easy Docker setup for Stable Diffusion with user-friendly UI
4M: Massively Multimodal Masked Modeling
A Production-ready Reinforcement Learning AI Agent Library
Hackable and optimized Transformers building blocks
Towards Real-World Vision-Language Understanding
Reproduction of Poetiq's record-breaking submission to the ARC-AGI-1
Unified Multimodal Understanding and Generation Models
Tooling for the Common Objects In 3D dataset
VGGSfM: Visual Geometry Grounded Deep Structure From Motion
A SOTA open-source image editing model
Multi-modal large language model designed for audio understanding
State-of-the-art (SoTA) text-to-video pre-trained model
OCR expert VLM powered by Hunyuan's native multimodal architecture
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning