VITS is a foundational research implementation of “VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” a well-known neural TTS architecture. Unlike traditional two-stage systems that separately train an acoustic model and a vocoder, VITS trains an end-to-end model that maps text directly to waveform using a conditional variational autoencoder combined with normalizing flows and adversarial training. This architecture enables parallel generation (fast inference) while achieving speech quality that rivals or surpasses many two-stage systems. The repository provides training and inference pipelines for common datasets such as LJ Speech (single-speaker) and VCTK (multi-speaker), including filelists, configs, and preprocessing scripts. It also includes monotonic alignment search code and g2p preprocessing, which are crucial components for aligning text and speech in an end-to-end setup.
Features
- End-to-end TTS model combining conditional VAE, normalizing flows, and adversarial training
- Parallel waveform generation with high naturalness compared to classic two-stage pipelines
- Ready-made training recipes for LJ Speech and VCTK datasets (single and multi-speaker)
- Monotonic alignment search implementation and phoneme preprocessing scripts
- PyTorch-based code suitable for research, modification, and experimental extensions
- Widely adopted baseline architecture for many derivative and improved TTS systems