Orpheus TTS is a state-of-the-art open-source text-to-speech system built on a Llama-3B backbone, treating speech synthesis as a large language model problem instead of a traditional TTS pipeline. It is designed to produce human-like speech with natural intonation, emotion, and rhythm, targeting quality comparable to or better than many closed-source systems. The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. The maintainers provide Colab notebooks, a standardized prompting format, and one-click deployment via Baseten for production-grade, FP8/FP16 optimized inference with ~200 ms streaming latency.
Features
- Llama-3B-based TTS architecture that treats speech as a generative LLM task
- Pretrained and finetuned English models plus a multilingual research family with training guide
- Zero-shot voice cloning and tag-based control over emotion, intonation, and speaking style
- Streaming inference with ~200 ms latency using vLLM and real-time audio chunk generation
- One-click production deployment via Baseten with FP8/FP16 optimized inference options
- Support for watermarking generated audio and CPU-only inference through llama.cpp integrations