AudioMuse-AI is an Open Source Dockerized environment
Fast multimodal LLM for real-time voice interaction and AI apps
Taming Stable Diffusion for Lip Sync
Synchronized Translation for Videos
A nearly-live implementation of OpenAI's Whisper
Multimodal Diffusion with Representation Alignment
Open speech-to-speech models and pipelines by Hugging Face toolkit AI
Multilingual speech recognition and audio understanding model
Capable of understanding text, audio, vision, video
Qwen3-omni is a natively end-to-end, omni-modal LLM
Instant voice cloning by MIT and MyShell. Audio foundation model
Oobabooga - The definitive Web UI for local AI, with powerful features
Comprehensive Gradio WebUI for audio processing
SOTA discrete acoustic codec models with 40/75 tokens per second
Clone a voice in 5 seconds to generate arbitrary speech in real-time
SOTA Open Source TTS
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
AI video generator optimized for low VRAM and older GPUs use
Free, high-quality text-to-speech API endpoint to replace OpenAI
Data manipulation and transformation for audio signal processing
Framework for building real-time voice and multimodal AI agents
Sample code and notebooks for Generative AI on Google Cloud
Automatically translates the text of a video based on a subtitle file
Interface for OuteTTS models
A Web UI for easy subtitle using whisper model