CSM-1B (Conversational Speech Model) is a text-to-speech model developed by Sesame, designed to generate natural-sounding audio using text and audio prompts. Built on a LLaMA-based architecture and paired with a lightweight Mimi audio decoder, CSM-1B produces RVQ audio codes for realistic voice synthesis. It supports both single-sentence audio generation and full conversational modeling with contextual audio and text input. While not fine-tuned to mimic specific voices, it can create a wide range of synthetic speaker identities. It runs natively on Hugging Face Transformers (v4.52.1+) and supports batched inference, CUDA graph compilation, and fine-tuning with the standard Transformers Trainer. Though optimized for English, it has limited multilingual capabilities due to data overlap. CSM-1B is released under the Apache-2.0 license and includes strict ethical use guidelines prohibiting impersonation, misinformation, and other forms of misuse.
Features
- Text-to-speech generation using RVQ audio code output
- LLaMA-based model with Mimi audio decoder
- Supports full conversational input with contextual audio
- Batched inference and CUDA graph support for efficiency
- Fine-tuning available via Transformers’ Trainer API
- Native support in Hugging Face Transformers (v4.52.1+)
- Open-ended voice generation without predefined speakers
- Ethical use policy to prevent impersonation and misuse