ⓍTTS-v2 (XTTS-v2) by Coqui is a powerful multilingual text-to-speech model capable of cloning voices from a short 6-second audio sample. It supports 17 languages and enables high-quality voice generation with emotion, style transfer, and cross-language synthesis. The model introduces major improvements over ⓍTTS-v1, including better prosody, stability, and support for Hungarian and Korean. ⓍTTS-v2 allows interpolation between multiple voice references and generates speech at a 24kHz sampling rate. It's ideal for both inference and fine-tuning, with APIs and command-line tools available. The model powers Coqui Studio and the Coqui API, and can be run locally using Python or through Hugging Face Spaces. Licensed under the Coqui Public Model License, it balances open access with responsible use of generative voice technology.
Features
- Voice cloning from a 6-second audio clip
- Supports 17 languages including Arabic, Chinese, Hindi, and Japanese
- Emotion and style transfer capabilities
- Cross-lingual voice generation with speaker consistency
- 24kHz audio output for high sound quality
- Improved prosody and speaker conditioning over v1
- Fine-tuning and interpolation between multiple voice samples
- Compatible with Python, CLI, and Hugging Face Inference Spaces