csm-1b

CSM-1B (Conversational Speech Model) is a text-to-speech model developed by Sesame, designed to generate natural-sounding audio using text and audio prompts. Built on a LLaMA-based architecture and paired with a lightweight Mimi audio decoder, CSM-1B produces RVQ audio codes for realistic voice synthesis. It supports both single-sentence audio generation and full conversational modeling with contextual audio and text input. While not fine-tuned to mimic specific voices, it can create a wide range of synthetic speaker identities. It runs natively on Hugging Face Transformers (v4.52.1+) and supports batched inference, CUDA graph compilation, and fine-tuning with the standard Transformers Trainer. Though optimized for English, it has limited multilingual capabilities due to data overlap. CSM-1B is released under the Apache-2.0 license and includes strict ethical use guidelines prohibiting impersonation, misinformation, and other forms of misuse.

Features

Text-to-speech generation using RVQ audio code output
LLaMA-based model with Mimi audio decoder
Supports full conversational input with contextual audio
Batched inference and CUDA graph support for efficiency
Fine-tuning available via Transformers’ Trainer API
Native support in Hugging Face Transformers (v4.52.1+)
Open-ended voice generation without predefined speakers
Ethical use policy to prevent impersonation and misuse

Project Samples

Project Activity

See All Activity >

Follow csm-1b

csm-1b Web Site

Other Useful Business Software

Gen AI apps are built with MongoDB Atlas

Build gen AI apps with an all-in-one modern database: MongoDB Atlas

MongoDB Atlas provides built-in vector search and a flexible document model so developers can build, scale, and run gen AI apps without stitching together multiple databases. From LLM integration to semantic search, Atlas simplifies your AI architecture—and it’s free to get started.

Start Free

Rate This Project

User Reviews

Be the first to post a review of csm-1b!

Additional Project Details

Registered

2025-06-27

Similar Business Software

Chatterbox

Chatterbox is a free, open source voice cloning AI model developed by Resemble AI, licensed under MIT. It enables zero-shot voice cloning using just 5 seconds of reference audio, eliminating the need for training. The model offers expressive speech synthesis with unique emotion control, allowing...

See Software
Piper TTS

Piper is a fast, local neural text-to-speech (TTS) system optimized for devices like the Raspberry Pi 4, designed to deliver high-quality speech synthesis without relying on cloud services. It utilizes neural network models trained with VITS and exported to ONNX Runtime, enabling efficient and...

See Software
MARS6

CAMB.AI's MARS6 is a groundbreaking text-to-speech (TTS) model that has become the first speech model accessible on Amazon Web Services (AWS) Bedrock platform. This integration allows developers to incorporate advanced TTS capabilities into generative AI applications, facilitating the creation...

See Software
Chirp 3

Google Cloud's Text-to-Speech API introduces Chirp 3, enabling users to create personalized voice models using their own high-quality audio recordings. This feature facilitates the rapid generation of custom voices, which can be utilized to synthesize audio through the Cloud Text-to-Speech API,...

See Software
AudioLM

AudioLM is a pure audio language model that generates high‑fidelity, long‑term coherent speech and piano music by learning from raw audio alone, without requiring any text transcripts or symbolic representations. It represents audio hierarchically using two types of discrete tokens, semantic...

See Software
Kokoro TTS

Kokoro TTS is an efficient text-to-speech tool with multilingual and customizable voice support. Its 182M parameter architecture delivers high-quality audio, supporting languages like American English, British English, French, Korean, Japanese, and Mandarin. It features lifelike voice options,...

See Software