XTTS-v2
Multilingual voice cloning model with 6-second voice samples
...The model supports 17 languages and can perform cross-language voice cloning, meaning a voice recorded in one language can be used to synthesize speech in another while preserving speaker identity. XTTS-v2 improves on the original XTTS architecture with better speaker conditioning, support for multiple reference clips, improved prosody, enhanced audio quality, and greater inference stability. The model generates speech at a 24 kHz sampling rate and supports emotion and style transfer through voice cloning. It can be used entirely offline, supports both inference and fine-tuning, and is widely adopted for AI assistants, content creation, dubbing, accessibility tools, and multilingual voice applications.