Qwen3-TTS is an open-source series of TTS models
Industrial-level controllable zero-shot text-to-speech system
Miso TTS is an 8 billion, highly emotive text-to-speech model
Open Source Speech Language Model
GLM-4-Voice | End-to-End Chinese-English Conversational Model
super expressive prompting model based on ltx2.3
Long-form streaming TTS system for multi-speaker dialogue generation
Open-source multi-speaker long-form text-to-speech model
Qwen3-omni is a natively end-to-end, omni-modal LLM
Capable of understanding text, audio, vision, video
State-of-the-art TTS model under 25MB
Controllable & emotion-expressive zero-shot TTS
Open-source framework for intelligent speech interaction
MOSS‑TTS Family open‑source speech and sound generation model
LLM-based Reinforcement Learning audio edit model
Qwen3-ASR is an open-source series of ASR models
Audio foundation model excelling in audio understanding
Repo of Qwen2-Audio chat & pretrained large audio language model
A 0.1B Omni model trained from scratch
Open-source industrial-grade ASR models
FAIR Sequence Modeling Toolkit 2
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Audio Language Models are Few-Shot Learners
Multi-modal large language model designed for audio understanding
Foundational Models for State-of-the-Art Speech and Text Translation