WhisperSpeech is an open-source text-to-speech system created by “inverting” OpenAI’s Whisper, reusing its strengths as a semantic audio model to generate speech instead of only transcribing it. The project aims to be for speech what Stable Diffusion is for images: powerful, hackable, and safe for commercial use, with code under Apache-2.0/MIT and models trained only on properly licensed data. Its architecture follows a token-based, multi-stage pipeline inspired by AudioLM and SPEAR-TTS: Whisper is used to produce semantic tokens, EnCodec compresses the waveform into acoustic tokens, and Vocos reconstructs high-fidelity audio from those tokens. The repository includes notebooks and scripts for inference, long-form synthesis, and finetuning, as well as pre-trained models and converted datasets hosted on Hugging Face. Performance optimizations like torch.compile, KV-caching, and architectural tweaks allow the main model to reach up to 12× real-time speed on a consumer RTX 4090.
Features
- Text-to-speech system built by inverting Whisper into a semantic token generator
- Three-stage pipeline using Whisper (semantic), EnCodec (acoustic tokens), and Vocos (vocoder)
- Open-source code under Apache-2.0/MIT with models trained on properly licensed datasets
- High-performance inference with optimizations like torch.compile and KV-caching for 10×+ real-time speed on GPUs
- Support for voice cloning, multilingual experiments, and code-switching within a single utterance
- Notebooks and scripts for long-form generation, finetuning, and community-driven benchmarking