Open-source multi-speaker long-form text-to-speech model
Speech recognition module for Python
A nearly-live implementation of OpenAI's Whisper
Generate audiobooks from e-books, voice cloning & 1107+ languages
Generate audiobooks from EPUBs, PDFs and text with captions
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Qwen3-omni is a natively end-to-end, omni-modal LLM
Speech-to-text, text-to-speech, and speaker recognition
A free, open source, and extensible speech-to-text application
Stable diffusion for real-time music generation (web app)
Instant voice cloning by MIT and MyShell. Audio foundation model
Capable of understanding text, audio, vision, video
State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX
SOTA discrete acoustic codec models with 40/75 tokens per second
Synchronized Translation for Videos
48khz stereo neural audio codec for general audio
Free, high-quality text-to-speech API endpoint to replace OpenAI
Comprehensive Gradio WebUI for audio processing
Captcha solver extension for humans
AI video generator optimized for low VRAM and older GPUs use
Framework for building real-time voice and multimodal AI agents
Fast multimodal LLM for real-time voice interaction and AI apps
Self-hosted AI audio transcription
A multimodal model for brain response prediction