VALL-E X
Open source implementation of Microsoft's VALL-E X zero-shot TTS model
VALL-E-X is an open-source implementation of Microsoft’s VALL-E X zero-shot text-to-speech model, focused on multilingual, cross-lingual voice cloning. It is capable of synthesizing speech in English, Chinese, and Japanese from text while mimicking the voice characteristics of a speaker given only a short 3–10 second prompt. The model attempts to match not just timbre, but also tone, pitch, emotion, and prosody of the reference audio, resulting in highly personalized output. VALL-E-X supports zero-shot cross-lingual synthesis, meaning a monolingual speaker’s voice can be used to speak other languages without additional training. It also preserves aspects of the acoustic environment, such as background noise or reverb, making the generated audio feel more like it came from the same setting as the prompt. ...