WhisperSpeech is an open-source text-to-speech system created by “inverting” OpenAI’s Whisper, reusing its strengths as a semantic audio model to generate speech instead of only transcribing it. The project aims to be for speech what Stable Diffusion is for images: powerful, hackable, and safe for commercial use, with code under Apache-2.0/MIT and models trained only on properly licensed data. Its architecture follows a token-based, multi-stage pipeline inspired by AudioLM and SPEAR-TTS: Whisper is used to produce semantic tokens, EnCodec compresses the waveform into acoustic tokens, and Vocos reconstructs high-fidelity audio from those tokens. The repository includes notebooks and scripts for inference, long-form synthesis, and finetuning, as well as pre-trained models and converted datasets hosted on Hugging Face. Performance optimizations like torch.compile, KV-caching, and architectural tweaks allow the main model to reach up to 12× real-time speed on a consumer RTX 4090.

Features

  • Text-to-speech system built by inverting Whisper into a semantic token generator
  • Three-stage pipeline using Whisper (semantic), EnCodec (acoustic tokens), and Vocos (vocoder)
  • Open-source code under Apache-2.0/MIT with models trained on properly licensed datasets
  • High-performance inference with optimizations like torch.compile and KV-caching for 10×+ real-time speed on GPUs
  • Support for voice cloning, multilingual experiments, and code-switching within a single utterance
  • Notebooks and scripts for long-form generation, finetuning, and community-driven benchmarking

Project Samples

Project Activity

See All Activity >

Categories

Text to Speech

License

MIT License

Follow WhisperSpeech

WhisperSpeech Web Site

Other Useful Business Software
Full-stack observability with actually useful AI | Grafana Cloud Icon
Full-stack observability with actually useful AI | Grafana Cloud

Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Create free account
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of WhisperSpeech!

Additional Project Details

Programming Language

Python

Related Categories

Python Text to Speech Software

Registered

2025-11-28