Industrial-level controllable zero-shot text-to-speech system
Qwen3-omni is a natively end-to-end, omni-modal LLM
Capable of understanding text, audio, vision, video
Open-source multi-speaker long-form text-to-speech model
GLM-4-Voice | End-to-End Chinese-English Conversational Model
State-of-the-art TTS model under 25MB
Controllable & emotion-expressive zero-shot TTS
Qwen3-TTS is an open-source series of TTS models
Repo of Qwen2-Audio chat & pretrained large audio language model
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
FAIR Sequence Modeling Toolkit 2
Foundational Models for State-of-the-Art Speech and Text Translation
Chat & pretrained large audio language model proposed by Alibaba Cloud
A Conversational Speech Generation Model
PyTorch implementation of VALL-E (Zero-Shot Text-To-Speech)
Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)
Russian ASR model fine-tuned on Common Voice and CSS10 datasets
CTC-based forced aligner for audio-text in 158 languages
Portuguese ASR model fine-tuned on XLSR-53 for 16kHz audio input
Dia-1.6B generates lifelike English dialogue and vocal expressions