XTTS-v2 is an open-source multilingual text-to-speech and voice cloning model developed by Coqui. It enables zero-shot voice cloning using as little as six seconds of reference audio, allowing users to generate speech that closely matches a target speaker without additional training. The model supports 17 languages and can perform cross-language voice cloning, meaning a voice recorded in one language can be used to synthesize speech in another while preserving speaker identity. XTTS-v2 improves on the original XTTS architecture with better speaker conditioning, support for multiple reference clips, improved prosody, enhanced audio quality, and greater inference stability. The model generates speech at a 24 kHz sampling rate and supports emotion and style transfer through voice cloning. It can be used entirely offline, supports both inference and fine-tuning, and is widely adopted for AI assistants, content creation, dubbing, accessibility tools, and multilingual voice applications.
Features
- Zero-shot voice cloning from as little as 6 seconds of audio
- Supports 17 languages including English and Spanish
- Cross-language voice cloning while preserving speaker identity
- Emotion and speaking style transfer through voice cloning
- 24 kHz audio generation for high-quality speech output
- Supports multiple speaker reference clips and interpolation
- Can run locally without requiring cloud APIs
- Supports both inference and fine-tuning workflows