MARS5-TTS is CAMB.AI’s open-source English speech model designed for high-quality text-to-speech and voice emulation. It uses a two-stage architecture that combines an autoregressive (AR) model with a non-autoregressive (NAR) model, giving it both expressiveness and speed. The model is built to handle prosodically challenging content such as sports commentary, anime dialogue, and other high-energy or highly varied speech patterns with realistic rhythm and intonation. To control speaker identity, MARS5 uses a short reference audio clip, typically between 2 and 12 seconds, from which it learns the voice characteristics. It supports two main inference modes: shallow clone, which is faster and only needs the reference audio, and deep clone, which additionally uses the transcript of the reference audio to increase similarity and naturalness at the cost of more computation.
Features
- Two-stage AR–NAR TTS architecture optimized for difficult prosody and expressive speech
- Voice cloning from short 2–12 second reference clips with optional deep clone mode using transcripts
- Torch Hub integration for easy loading without cloning the repo plus Hugging Face support
- Configurable inference options (temperature, top-k, repetition penalties and more) for fine control over output
- 24 kHz audio generation suitable for production-quality speech applications
- Designed for use cases like sports commentary, anime, movies, and other prosodically complex audio