DiffSinger is an open-source PyTorch implementation of a diffusion-based acoustic model for singing-voice synthesis (SVS) and also text-to-speech (TTS) in a related variant. The core idea is to view generation of a sung voice (mel-spectrogram) as a diffusion process: starting from noise, the model iteratively “denoises” while being conditioned on a music score (lyrics, pitch, musical timing). This avoids some of the typical problems of prior SVS models — like over-smoothing or unstable GAN training — and produces more realistic, expressive, and natural-sounding singing. The method introduces a “shallow diffusion” mechanism: instead of diffusing over many steps, generation begins at a shallow step determined adaptively, which leverages prior knowledge learned by a simple mel-spectrogram decoder and speeds up inference.
Features
- Diffusion-based singing voice synthesis (SVS) conditioned on musical score
- Support for multiple input modalities: lyrics + pitch (F0), lyrics + MIDI
- Shallow diffusion mechanism for faster inference without compromising quality
- Built-in vocoder integration (HiFiGAN / NSF-HiFiGAN) to convert mel-spectrogram to waveform
- Also supports conventional text-to-speech (TTS), not just singing
- Pretrained models and example workflows to simplify getting started