Audio foundation model excelling in audio understanding
Official Python inference and LoRA trainer package
A Family of Open Sourced Music Foundation Models
Multimodal Diffusion with Representation Alignment
Open-source multi-speaker long-form text-to-speech model
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
A Systematic Framework for Interactive World Modeling
Multimodal-Driven Architecture for Customized Video Generation
Controllable & emotion-expressive zero-shot TTS
Python inference and LoRA trainer package for the LTX-2 audio–video
Open Source Speech Language Model
Industrial-level controllable zero-shot text-to-speech system
Qwen3-ASR is an open-source series of ASR models
Official repository for LTX-Video
Qwen3-TTS is an open-source series of TTS models
High-resolution models for human tasks
Foundational Models for State-of-the-Art Speech and Text Translation
Large Multimodal Models for Video Understanding and Editing
A Conversational Speech Generation Model
Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation
Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)
React app for inspecting, building and debugging with the Realtime API
Dia-1.6B generates lifelike English dialogue and vocal expressions
CTC-based forced aligner for audio-text in 158 languages
Portuguese ASR model fine-tuned on XLSR-53 for 16kHz audio input