DC-TTS is a TensorFlow implementation of the DC-TTS architecture, a fully convolutional text-to-speech system designed to be efficiently trainable while producing natural speech. It follows the “Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention” paper, but the author adapts and extends the design to make it practical for real experiments. The model is split into two networks: Text2Mel, which maps text to mel-spectrograms, and SSRN (spectrogram super-resolution network), which converts low-resolution mel-spectrograms into high-resolution magnitude spectrograms suitable for waveform synthesis. Training scripts, data loaders, and hyperparameter configurations are provided to reproduce results on several datasets, including LJ Speech for English, a Korean single-speaker dataset, and audiobook data from Nick Offerman and Kate Winslet.
Features
- TensorFlow implementation of the DC-TTS architecture with convolution-only networks for text-to-speech
- Two-stage pipeline with Text2Mel and SSRN networks for mel-spectrogram generation and super-resolution
- Ready-made training scripts, data loaders, and hyperparameters for multiple English and Korean speech datasets
- Guided attention mechanism that encourages monotonic alignments and stabilizes training
- Support for normalization, dropout, and learning-rate decay to improve robustness versus the original paper
- Pretrained model for LJ Speech plus synthesis utilities to generate audio samples directly from text