Dia is a neural text-to-speech model designed specifically for generating ultra-realistic dialogue in a single pass. Instead of focusing on isolated sentences or flat narration, it is optimized for conversational audio, complete with natural turn-taking, prosody, and pacing. The model can be conditioned on a reference audio sample, allowing you to control emotion, tone, and other stylistic aspects of the speech. It can also produce nonverbal vocalizations like laughter, coughs, clearing the throat, and similar sounds, which are crucial for making synthetic conversations feel human. Dia is released with pretrained checkpoints and inference code, with weights hosted on Hugging Face, so researchers and developers can quickly try it or integrate it into pipelines. The base model currently targets English and has around 1.6 billion parameters, offering a strong balance between realism and computational cost, while the ecosystem also includes Dia2.
Features
- One-pass dialogue TTS model focused on highly realistic conversational speech
- Audio conditioning to control emotion, tone, and speaking style of the generated voice
- Support for nonverbal events such as laughter, coughing, and throat-clearing within generated audio
- Pretrained checkpoints and inference scripts with weights hosted on Hugging Face
- English-focused 1.6B-parameter architecture balancing quality with inference cost
- Companion streaming variant (Dia2) for real-time conversational scenarios and low-latency synthesis