Multi-modal large language model designed for audio understanding
ComfyUI integration for Microsoft's VibeVoice text-to-speech model
Interface for OuteTTS models
Open-source multi-speaker long-form text-to-speech model
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Automatic Speech Recognition with Word-level Timestamps
TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
A Web UI for easy subtitle using whisper model
One-click deployment (including offline integration package)
Official PyTorch Implementation
An Open Source implementation of Notebook LM with more flexibility
Synchronized Translation for Videos
Instant voice cloning by MIT and MyShell. Audio foundation model
High-Quality Voice Cloning TTS for 600+ Languages
MARS5 speech model (TTS) from CAMB.AI
A Python library for audio
Speech recognition module for Python
Audiocraft is a library for audio processing and generation
Foundational model for human-like, expressive TTS
The most powerful and modular diffusion model GUI, api and backend
Towards Human-Level Text-to-Speech through Style Diffusion
The official Python Library for the Groq API
Build AI-powered semantic search applications
Edit videos with Claude Code
The data structure for multimodal data