Framework for building realtime multimodal voice AI agents apps
GLM-4-Voice | End-to-End Chinese-English Conversational Model
"VideoRAG: Chat with Your Videos
Context data platform for building observable, self-learning AI agents
Foundational model for human-like, expressive TTS
Official MiniMax Model Context Protocol (MCP) server
Build multimodal AI applications with cloud-native stack
Marrying Grounding DINO with Segment Anything & Stable Diffusion
Improve human sleep through scientifically
Software that uses AI to perform real-time voice conversion
An AI for Music Generation
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Meta-Datenbank-Anwendung für die Audio- und TV-Sendungen des CC2.TV
Data Lake for Deep Learning. Build, manage, and query datasets
Build cross-modal and multimodal applications on the cloud
Toolkit for audio, music, and speech generation
Towards Human-Level Text-to-Speech through Style Diffusion
High-quality multi-lingual text-to-speech library by MyShell.ai
A Conversational Speech Generation Model
SPPAS - the automatic annotation and analyses of speech
Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation
An extremely simple tool for separating vocals and background music
Open source implementation of Microsoft's VALL-E X zero-shot TTS model
Unofficial Parallel WaveGAN
The PyTorch-based audio source separation toolkit for researchers