Miso TTS is an advanced 8-billion-parameter text-to-speech model developed by Miso Labs for generating highly expressive and natural-sounding conversational speech. Built on an RVQ Transformer architecture inspired by Sesame CSM, it combines a powerful Llama-based backbone with an autoregressive audio decoder to produce high-quality audio from text. The model supports both standard speech synthesis and voice-conditioned generation using optional audio prompts for voice cloning. Miso TTS generates Mimi audio codes and can leverage conversation history to create more contextually aware and realistic dialogue. Designed for local deployment, it offers watermarking by default to help promote responsible use of generated audio. With its focus on emotive speech generation, Miso TTS delivers state-of-the-art performance for AI voice applications, virtual assistants, and conversational AI experiences.
Features
- High-Quality Speech Synthesis – Generates natural, expressive, and emotionally rich speech from text input.
- Voice Cloning Support – Uses optional audio prompts and transcripts to create speech that matches a specific voice.
- Advanced RVQ Transformer Architecture – Combines an 8B-parameter backbone with a dedicated audio decoder for realistic audio generation.
- Context-Aware Dialogue Generation – Supports conditioning on previous conversation history for more coherent and conversational outputs.
- Built-In Audio Watermarking – Applies watermarking to generated audio by default to encourage responsible deployment and content attribution.
- Local & GPU-Accelerated Deployment – Runs locally with Hugging Face-hosted model weights and optimized CUDA-based inference for high-performance generation.