Multilingual speech recognition and audio understanding model
Audio foundation model excelling in audio understanding
SOTA Open Source TTS
Repo of Qwen2-Audio chat & pretrained large audio language model
Open speech-to-speech models and pipelines by Hugging Face toolkit AI
Robust Speech Recognition via Large-Scale Weak Supervision
Speech recognition module for Python
A Lightweight Face Recognition and Facial Attribute Analysis
Open-source framework for intelligent speech interaction
LLM-based Reinforcement Learning audio edit model
Open-source industrial-grade ASR models
kaldi-asr/kaldi is the official location of the Kaldi project
Towards Human-Sounding Speech
A PyTorch-based Speech Toolkit
A TTS model capable of generating ultra-realistic dialogue
Controllable & emotion-expressive zero-shot TTS
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Instant voice cloning by MIT and MyShell. Audio foundation model
StreamSpeech is a seamless model for offline speech recognition
Multilingual Automatic Speech Recognition with word-level timestamps
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
Interface for OuteTTS models
Toolkit for conversational AI
Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles
Fast multimodal LLM for real-time voice interaction and AI apps