VibeVoice ComfyUI is a comprehensive wrapper that integrates Microsoft’s VibeVoice text-to-speech models directly into ComfyUI workflows. It exposes VibeVoice as a set of custom nodes so you can build single-speaker and multi-speaker voice generation pipelines visually, combining TTS with other audio or generative components. The integration supports high-quality single-speaker synthesis as well as scripted multi-speaker conversations, with optional voice cloning from audio samples for each speaker. It includes advanced control over generation parameters like attention backend, diffusion steps, sampling temperature, guidance scale, and quantization settings, allowing users to tune the trade-offs between quality, VRAM usage, and speed. The project also introduces first-class LoRA support, making it possible to fine-tune and load custom LoRA adapters that modify voice identity or style while keeping the base VibeVoice model intact.
Features
- Single-speaker and multi-speaker VibeVoice TTS nodes with optional voice cloning
- LoRA adapter support for fine-tuning voice characteristics and styles within ComfyUI
- Configurable generation controls including attention type, diffusion steps, CFG scale, and sampling options
- Quantization options (4-bit, 8-bit, full precision) to balance VRAM usage and audio quality
- Cross-platform support for CUDA, CPU, and Apple Silicon (MPS) in standard ComfyUI setups
- Helper nodes for text-file loading, VRAM freeing, and multi-speaker text formatting with [N]: labels