VibeVoice-1.5B is Microsoft’s frontier open-source text-to-speech (TTS) model designed for generating expressive, long-form, multi-speaker conversational audio such as podcasts. Unlike traditional TTS systems, it excels in scalability, speaker consistency, and natural turn-taking for up to 90 minutes of continuous speech with as many as four distinct speakers. A key innovation is its use of continuous acoustic and semantic speech tokenizers operating at an ultra-low frame rate of 7.5 Hz, enabling high audio fidelity with efficient processing of long sequences. The model integrates a Qwen2.5-based large language model with a diffusion head to produce realistic acoustic details and capture conversational context. Training involved curriculum learning with increasing sequence lengths up to 65K tokens, allowing VibeVoice to handle very long dialogues effectively. Safety mechanisms include an audible disclaimer and imperceptible watermarking in all generated audio to mitigate misuse risks.
Features
- Open-source TTS model for expressive, long-form conversational speech
- Generates up to 90 minutes of audio with up to 4 distinct speakers
- Continuous acoustic & semantic tokenizers at 7.5 Hz for fidelity and efficiency
- Integrates Qwen2.5-1.5B LLM with a diffusion head for context and realism
- Curriculum-trained on sequences up to 65K tokens for long dialogues
- Embedded audible disclaimer and imperceptible watermark in all outputs
- Licensed under MIT for open research and responsible development
- Focused on English and Chinese, not suitable for other languages or non-speech audio