XTTS-v2
Multilingual voice cloning model with 6-second voice samples
XTTS-v2 is an open-source multilingual text-to-speech and voice cloning model developed by Coqui. It enables zero-shot voice cloning using as little as six seconds of reference audio, allowing users to generate speech that closely matches a target speaker without additional training. The model supports 17 languages and can perform cross-language voice cloning, meaning a voice recorded in one language can be used to synthesize speech in another while preserving speaker identity. XTTS-v2 improves on the original XTTS architecture with better speaker conditioning, support for multiple reference clips, improved prosody, enhanced audio quality, and greater inference stability. ...