Audio foundation model excelling in audio understanding
Chat & pretrained large audio language model proposed by Alibaba Cloud
Repo of Qwen2-Audio chat & pretrained large audio language model
A Family of Open Sourced Music Foundation Models
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Qwen3-omni is a natively end-to-end, omni-modal LLM
Multimodal Diffusion with Representation Alignment
Capable of understanding text, audio, vision, video
Open-source multi-speaker long-form text-to-speech model
A Systematic Framework for Interactive World Modeling
Multimodal-Driven Architecture for Customized Video Generation
Industrial-level controllable zero-shot text-to-speech system
Python inference and LoRA trainer package for the LTX-2 audio–video
Qwen3-TTS is an open-source series of TTS models
Controllable & emotion-expressive zero-shot TTS
High-resolution models for human tasks
Official repository for LTX-Video
Foundational Models for State-of-the-Art Speech and Text Translation
Large Multimodal Models for Video Understanding and Editing
GLM-4-Voice | End-to-End Chinese-English Conversational Model
State-of-the-art TTS model under 25MB
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
A Conversational Speech Generation Model
Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation
Code for the paper Hybrid Spectrogram and Waveform Source Separation