1 min voice data can also be used to train a good TTS model
State-of-the-art TTS model under 25MB
SOTA Open Source TTS
Industrial-level controllable zero-shot text-to-speech system
Towards Human-Sounding Speech
Miso TTS is an 8 billion, highly emotive text-to-speech model
Capable of understanding text, audio, vision, video
Towards Human-Level Text-to-Speech through Style Diffusion
PyTorch implementation of VALL-E (Zero-Shot Text-To-Speech)