MegaTTS3 is an open-source text-to-speech (TTS) and voice-cloning system from ByteDance that aims to deliver high-quality, expressive speech synthesis, including zero-shot voice cloning of previously unseen speakers. Its backbone is a lightweight diffusion-transformer (on the order of ~0.45 B parameters), which enables efficient inference while still producing high-fidelity audio. Given a reference audio sample (and corresponding latent representation), MegaTTS3 can generate speech in the style and voice timbre of that speaker — useful for personalized TTS, voice-overs, dubbing, or multi-speaker applications. The system supports both Chinese and English (with code-switching), making it versatile across languages, and offers controls for accent strength, voice similarity, intelligibility vs. similarity tradeoffs, and other speech parameters to fine-tune output.
Features
- Zero-shot voice cloning — generate speech in the voice of an arbitrary speaker from a short reference sample
- Lightweight diffusion-transformer backbone (~0.45 B parameters), enabling efficient inference even on modest hardware
- Bilingual (Chinese and English) support, including code-switching — useful for multilingual applications
- Fine-grained control over speech parameters (accent strength, voice similarity vs intelligibility, pronunciation/duration tweaks)
- Local-first operation (via Python or Docker) — no mandatory cloud dependency, increasing privacy and control
- Open-source under Apache-2.0 — weights and code accessible, enabling research, customization, or integration into custom pipelines