SOTA discrete acoustic codec models with 40/75 tokens per second
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
A nearly-live implementation of OpenAI's Whisper
AudioMuse-AI is an Open Source Dockerized environment
Automatic Speech Recognition with Word-level Timestamps
Comprehensive Gradio WebUI for audio processing
Open-source multi-speaker long-form text-to-speech model
AI video generator optimized for low VRAM and older GPUs use
Open speech-to-speech models and pipelines by Hugging Face toolkit AI
SOTA Open Source TTS
A general fine-tuning kit geared toward image/video/audio diffusion
Generate audiobooks from EPUBs, PDFs and text with captions
Multimodal Diffusion with Representation Alignment
Chat & pretrained large audio language model proposed by Alibaba Cloud
Qwen3-omni is a natively end-to-end, omni-modal LLM
Generate blog articles from video or audio
State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX
Free, high-quality text-to-speech API endpoint to replace OpenAI
Capable of understanding text, audio, vision, video
MARS5 speech model (TTS) from CAMB.AI
Robust Speech Recognition via Large-Scale Weak Supervision
The most powerful and modular diffusion model GUI, api and backend
Fast multimodal LLM for real-time voice interaction and AI apps
Multilingual speech recognition and audio understanding model
Open source AI model for generating full songs from lyrics prompts