Fast multimodal LLM for real-time voice interaction and AI apps
A nearly-live implementation of OpenAI's Whisper
Qwen3-omni is a natively end-to-end, omni-modal LLM
Taming Stable Diffusion for Lip Sync
Synchronized Translation for Videos
One-click deployment (including offline integration package)
AudioMuse-AI is an Open Source Dockerized environment
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Instant voice cloning by MIT and MyShell. Audio foundation model
Open speech-to-speech models and pipelines by Hugging Face toolkit AI
SOTA discrete acoustic codec models with 40/75 tokens per second
SOTA Open Source TTS
AI video generator optimized for low VRAM and older GPUs use
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Comprehensive Gradio WebUI for audio processing
Oobabooga - The definitive Web UI for local AI, with powerful features
Multimodal Diffusion with Representation Alignment
The official Python SDK for the ElevenLabs API
AI tool converting video/audio into structured documents instantly
Capable of understanding text, audio, vision, video
Data manipulation and transformation for audio signal processing
ComfyUI integration for Microsoft's VibeVoice text-to-speech model
Open source AI model for generating full songs from lyrics prompts
A general fine-tuning kit geared toward image/video/audio diffusion
Robust Speech Recognition via Large-Scale Weak Supervision