Multimodal Diffusion with Representation Alignment
Code for openai.fm, a demo for the OpenAI Speech API
Qwen3-omni is a natively end-to-end, omni-modal LLM
Speech-to-text, text-to-speech, and speaker recognition
Generate audiobooks from e-books, voice cloning & 1107+ languages
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Clone a voice in 5 seconds to generate arbitrary speech in real-time
A free, open source, and extensible speech-to-text application
AudioMuse-AI is an Open Source Dockerized environment
AI video generator optimized for low VRAM and older GPUs use
Stable diffusion for real-time music generation (web app)
Captcha solver extension for humans
Comprehensive Gradio WebUI for audio processing
Synchronized Translation for Videos
Instant voice cloning by MIT and MyShell. Audio foundation model
Implementation of AudioLM audio generation model in Pytorch
SOTA discrete acoustic codec models with 40/75 tokens per second
A multimodal model for brain response prediction
48khz stereo neural audio codec for general audio
A private, local meeting notes assistant
Free, high-quality text-to-speech API endpoint to replace OpenAI
Automatic Speech Recognition with Word-level Timestamps
Generate audiobooks from EPUBs, PDFs and text with captions
Fast multimodal LLM for real-time voice interaction and AI apps
Make videos programmatically with React