VoxCPM
TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
VoxCPM is a tokenizer-free text-to-speech system that models speech in a continuous space, aiming for extremely realistic, context-aware synthesis and true-to-life zero-shot voice cloning. Instead of converting speech into discrete tokens, it uses an end-to-end diffusion-autoregressive architecture built on the MiniCPM-4 backbone, combining hierarchical language modeling, finite scalar quantization (FSQ), and local Diffusion Transformers. This design helps decouple semantic and acoustic information while preserving fine-grained prosody, leading to more stable and expressive generation than many discrete-token systems. Trained on a large 1.8-million-hour bilingual corpus, VoxCPM can infer appropriate speaking style from context, dynamically adjusting intonation, rhythm, and emotional tone. ...