VITS download | SourceForge.net

VITS is a foundational research implementation of “VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” a well-known neural TTS architecture. Unlike traditional two-stage systems that separately train an acoustic model and a vocoder, VITS trains an end-to-end model that maps text directly to waveform using a conditional variational autoencoder combined with normalizing flows and adversarial training. This architecture enables parallel generation (fast inference) while achieving speech quality that rivals or surpasses many two-stage systems. The repository provides training and inference pipelines for common datasets such as LJ Speech (single-speaker) and VCTK (multi-speaker), including filelists, configs, and preprocessing scripts. It also includes monotonic alignment search code and g2p preprocessing, which are crucial components for aligning text and speech in an end-to-end setup.

Features

End-to-end TTS model combining conditional VAE, normalizing flows, and adversarial training
Parallel waveform generation with high naturalness compared to classic two-stage pipelines
Ready-made training recipes for LJ Speech and VCTK datasets (single and multi-speaker)
Monotonic alignment search implementation and phoneme preprocessing scripts
PyTorch-based code suitable for research, modification, and experimental extensions
Widely adopted baseline architecture for many derivative and improved TTS systems

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow VITS

VITS Web Site

Other Useful Business Software

Build Agents and Models on One Platform

Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free

Rate This Project

User Reviews

Be the first to post a review of VITS!

Additional Project Details

Programming Language

Python

Related Categories

Python Text to Speech Software, Python Text-to-Speech (TTS) Models

Registered

2025-11-28

Similar Business Software

Piper TTS

Piper is a fast, local neural text-to-speech (TTS) system optimized for devices like the Raspberry Pi 4, designed to deliver high-quality speech synthesis without relying on cloud services. It utilizes neural network models trained with VITS and exported to ONNX Runtime, enabling efficient and...

See Software
Adobe Firefly

Adobe Firefly is an AI-powered creative platform that enables users to generate and edit images, videos, and other media using simple text prompts. It provides an intuitive workspace where users can create content on an infinite canvas and experiment with different creative ideas. The platform...

See Software
Google AI Studio

Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3.5. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use...

See Software
Murf AI

Murf AI is a text-to-speech and AI voice generation platform designed to create realistic voiceovers quickly and efficiently. It allows users to convert text into natural-sounding speech using a wide range of voices and languages. The platform includes a studio environment where users can...

See Software
Qwen3-TTS

Qwen3-TTS is an open source series of advanced text-to-speech models developed by the Qwen team at Alibaba Cloud under the Apache-2.0 license, offering stable, expressive, and real-time speech generation with features such as voice cloning, voice design, and fine-grained control of prosody and...

See Software
Octave TTS

Hume AI has introduced Octave (Omni-capable Text and Voice Engine), a groundbreaking text-to-speech system that leverages large language model technology to understand and interpret the context of words, enabling it to generate speech with appropriate emotions, rhythm, and cadence, unlike...

See Software