Audio foundation model excelling in audio understanding
Open-source framework for intelligent speech interaction
Repo of Qwen2-Audio chat & pretrained large audio language model
LLM-based Reinforcement Learning audio edit model
Multi-modal large language model designed for audio understanding
Official Python inference and LoRA trainer package
A Family of Open Sourced Music Foundation Models
Miso TTS is an 8 billion, highly emotive text-to-speech model
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Open-source multi-speaker long-form text-to-speech model
Chat & pretrained large audio language model proposed by Alibaba Cloud
Multimodal Diffusion with Representation Alignment
Qwen3-omni is a natively end-to-end, omni-modal LLM
Capable of understanding text, audio, vision, video
MOSS‑TTS Family open‑source speech and sound generation model
Multimodal-Driven Architecture for Customized Video Generation
A Systematic Framework for Interactive World Modeling
Open Source Speech Language Model
VMZ: Model Zoo for Video Modeling
Controllable & emotion-expressive zero-shot TTS
Official repository for LTX-Video
Industrial-level controllable zero-shot text-to-speech system
A 0.1B Omni model trained from scratch
Qwen3-ASR is an open-source series of ASR models
High-resolution models for human tasks